---
output: 
  pdf_document:
    keep_tex: yes
    fig_caption: yes
    latex_engine: pdflatex
indent: yes
urlcolor: blue
header-includes: \usepackage{float} \usepackage{graphicx} \usepackage{longtable} \usepackage{rotating}
  \usepackage{hhline} \usepackage{dcolumn} \usepackage{setspace}
title: "Ideological Extremification in the American Media Diet"
thanks: "Replication files are available at https://github.com/matthewdardet/GOV2001ReplicationPaper (GitHub) or https://doi.org/10.7910/DVN/AGIAZ4 (Dataverse). We thank Soubhik Barari, Chris Kenny, Gary King, and members of Ryan Enos' research group for their useful insights."
author:
- Matthew E. Dardet^[Department of Government, Harvard University, matthewdardet@fas.harvard.edu]
- Ben TerMaat^[Department of Government, Harvard University, btermaat@g.harvard.edu]
abstract: "Replicating Guess (2021) and Tyler, Grimmer, and Iyengar (2021), we find substantial overlap in the online media diets of Democrats and Republicans in 2015 and moderately less overlap in 2016, important findings given the contemporary concern over polarization driven by media echo chambers on the Internet. However, we also find that models of media consumption that account for partisanship alone without considering ideology exhibit signs of model dependence. Including ideology in these models reduces model dependence and explains greater variance in Americans' media diets. Futher, stratifying respondents by ideological type (i.e., if they are ideologues, near ideologues, moderate partisans, true non-partisan moderates, or have no issue content at all in their beliefs) reveals that more ideological respondents are exhibiting less overlap in their online media consumption and are moving closer to the echo chamber model of media consumption. Finally, we present evidence that different methods for classifying news articles from URLs and different survey samples cause differing results in the two papers."
keywords: "media diets, motivated reasoning, political polarization, machine learning"
date: "`r format(Sys.time(), '%d %B %Y')`"
geometry: margin=1in
fontsize: 12pt
spacing: double
bibliography: references.bib
nocite: '@*'
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(echo = TRUE, fig.pos = "!H", out.extra = "")
options(xtable.comment = FALSE)

# rm(list = ls())

# Load standard package suite. 
library(easypackages)
packages <- c("bayestestR", "car", "caret", "caTools", "conformalInference", 
              "data.table", "DeclareDesign", "devtools", "ebal", "estimatr", "extrafont", 
              "fixest", "foreign", "ggpubr", "ggridges", "ggthemes", "glmnet", 
              "gtsummary", "haven", "hrbrthemes", "here", "kableExtra", "lars", "lfe", 
              "lmtest", "Matching", "matchMulti", "margins", "mlr3", "modeest", 
              "modelsummary", "mvtnorm", "overlap", "papeR", "plotROC", "randomForest",
              "randomizr", "redist", "rgenoud","ri2", "rio", "rjson", "RobustSE", 
              "rticles", "rvest", "sandwich", "scales", "sf", 
              "spatstat", "stargazer", "tidyverse", "urltools", "xtable")

devtools::install_github(repo="ryantibs/conformal", subdir="conformalInference")

if(length(which(!packages %in% installed.packages())) > 0) {
  install.packages(packages[!packages %in% installed.packages()])
}

libraries(packages)

# Set working directory. 
setwd(here::here())

# Load helper functions. 
source("scripts/functions.R") 

# Read datasets. 
yg15 <- read_csv("input/Guess_2021/data2015.csv")
yg16 <- read_csv("input/Guess_2021/data2016.csv")
yg16mob <- read_csv("input/Guess_2021/data2016mobile.csv")
```

# Introduction

\doublespacing

One of the most important agendas for modern political science research involves quantifying and elucidating the causes of apparent increases in common measures and varieties of political polarization. In recent years, the literature has benefited from numerous studies that examine the relationship between heightened political polarization and digital technologies. However, stark disagreement remains regarding the fundamental question of whether Internet media and news contribute to political polarization. While some scholars conclude that social media has created polarizing echo chambers (Sunstein, 2001, 2017) or information silos (Pariser, 2011; Shmargad & Klar, 2020), others argue that such concerns are overblown (Boxell et al., 2017; Guess, 2021).  

In this study, we replicate two papers on opposing sides of this online media-driven polarization debate. The first paper, Guess (2021) "(Almost) Everything in Moderation: New Evidence on Americans' Online Media Diets," finds that "most people across the political spectrum have relatively moderate media diets, about a quarter of which consist of mainstream news websites and portals," (p. 1007). Guess (2021) reaches this finding by fielding opt-in YouGov Pulse surveys of respondents in 2015 ($N = 1,392$) and 2016 ($N = 2,512$), asking these respondents about their demographic characteristics, party identification, and ideological leanings. These respondents then agree to download a tool called the Wakoopa browser bar, which subsequently tracks and records all of the URLs that they view online. 

Guess (2021) employs a machine learning algorithm, logistic regression classification, to determine which URLs constitute news and which constitute non-news browsing behavior. Using a mapping of URLs to ideological slants developed by Bakshy et al. (2015), a mapping which relies on the self-reported ideologies of the Facebook users who visit these sites, Guess (2021) computes a simple mean ideological slant for the entire media diet (comprising all of the URLs visited) of each respondent. Ultimately, by calculating a measure of distributional overlap that we describe in the analysis below, Guess (2021) is able to determine the degree to which media diets of Republicans, Democrats, and Independents are overlapping (in other words, "similar" or "moderate"). 

The second paper, Tyler, Grimmer, & Iyengar (2021), follows a very similar analytical scheme but comes to the conclusion that Democrats and Republicans are more siloed in their media consumption habits than were those in Guess' (2021) findings. That is, Democrats and Republicans respectively consume left and right-leaning news articles on moderate and mainstream media sources. 

Two key differences exist in the data acquisition and methodological strategies of each study. The first difference is one of survey construction and sampling. Whereas Guess (2021) uses the same YouGov Pulse web browsing panel in 2015 and 2016, Tyler, Grimmer, & Iyengar (2021) fields a different YouGov panel survey in 2016 of $N = 7704$ respondents, out of which $1076$ or $14\%$ of respondents agreed to download the Wakoopa browsing history tracker. 

The second difference exists in how each paper classifies URL visits as "news" or as "not news." Whereas Guess (2021) learns a logistic regression classifier on document-feature matrices from a training set of news articles pre-coded as news/not news, Tyler, Grimmer, & Iyengar (2021) uses URL section-name proxies to get at the same classification. The classification procedure in Tyler, Grimmer, & Iyengar (2021) involves subsetting URLs to their second-level domain names (e.g, wikipedia.org, foxnews.com, etc.) and removing all of the URLs that are not on a pre-specified list of popular news domains. Then, the group focuses on explicitly U.S.-election-related news content within these news sites using a pre-specified vector of terms related to the 2016 election (e.g., Trump, Clinton, politics, Congress). 

In this replication paper, we seek to analyze these disparate URL classification strategies in Guess (2021) and Tyler, Grimmer, & Iyengar (2021) as well as the overall models used in each paper to predict Americans' media consumption. Along the way, we introduce a simple "bellwether" statistic for detecting reductions in the magnitude of model dependence in parametric linear regression models, the mean absolute uncertainty deviation (MAUD) and its standardized version (SMAUD). 

Ultimately, we come to three main findings. First, models that account for ideology in addition to party identification appear to be less model-dependent (when ideology is being properly captured by the URL classification strategy) than models that consider party identification alone. This finding has implications for the study of polarization as manifested in media diet consumption patterns. Should political scientists care more about the overall diets of Democrats and Republicans, or should they care more about the diets of ideologues within society?

The answer may be both, and coincides with a literature in political science that affirms the distinction between partisanship and ideology. Abramowitz & Saunders (2006) find that partisanship comprises elements of both traditional social identity (i.e., identifying with a party in order to satisfy an emotional attachment to group loyalty) and ideology. Green, Palmquist, and Shickler (2002) assert that party identification is influenced more by identification with social groups as opposed to a rational choice examination of party ideology or policies. They also contend that party loyalties themselves motivated citizens' issue positions, evaluations of political leaders, and vote choices rather than the other way around. 

Second, we find that, by stratifying respondents by ideological types similar to those outlined in Converse's (1964) seminal work, "The Nature of Belief Systems in Mass Publics," (which first sorted the mass public into the ideologues, near ideologues, group interest, nature of the times, and no issue content categories), the media diets of more ideological citizens have been becoming much more extreme over time, whereas those of the "moderate majority" have remained largely stable over time.

Third, using simple correlations between ideology and party identification in each study as well as more advanced, machine-learning-based strategies that implement Lei et al.'s (2018) leave-one-covariate-out (LOCO) inference, we show that the URL section name proxy method seems to classify news/not news differently than the logistic regression classification method. This in combination with differential survey samples likely account for the difference in results exhibited by both papers.

# Data and Research Design

Data for these analyses are national survey responses connected to URL browsing histories for each respondent generated by the Wakoopa web browsing tracker bar and come from Guess (2021) and Tyler, Grimmer, and Iyengar (2021), respectively. Data for this replication were limited by the fact that the URL data could contain private and sensitive information, and we were only able to obtain a cleaned version of the 2015 URL browsing history data from Guess (2021). 

As overviewed in the introduction, the surveys fielded by the two papers are distinct. The 2015 survey in Guess (2021) contained $N = 1,392$ respondents from the YouGov Pulse panel with $6,319,441$ URL observations collected between February 27 and March 19 of 2015, and the 2016 survey from the same paper was from another YouGov Pulse panel with $N = 2,512$ respondents and $16,984,989$ URL visits collected between October 7 and October 31 of 2016. Tyler, Grimmer, and Iyengar (2021) construct a 2016 YouGov panel (seemingly different from the company's Pulse panel offerings) using 7,704 respondents, out of which $N = 1,076$ or $14\%$ of respondents agreed to download the Wakoopa browsing history tracker that recorded all of the URLs they visited on their computers. One key question this study seeks to shed light on is that of whether it is the different survey composition or the URL classification method itself that could be driving these disparities in results. 

In the analysis that follows, we compute the average ideological slant of Americans' media diets as in Guess (2021). Then, we show that the linear models in Guess (2021) suffer from model dependence and that they can be improved by including respondent ideology as a parameter in the main model specifications. Lastly, using LOCO inference, we show that the URL section name proxy method seems to classify news/not news differently than the logistic regression classification method. This in combination with differential survey samples likely account for the difference in results exhibited by both papers. 

# Quantifying the American Media Diet

Are most Americans consuming the same sorts of online media, or are U.S. citizens polarized in their consumption patterns? In \textbf{Figure 1}, we use survey and URL browsing history data from Guess (2021) to plot the distributions of Americans' average media diet slants by party identification. Average media diet slant is computed by classifying respondent URL browsing histories as news/not news using a logistic regression classification algorithm as outlined above, then using the alignment scores from Bakshy et al. (2015) to classify the ideological leaning of a news article visit. Then, a simple mean is taken for each respondent. The figure displays distributions of average media diet slant in the years 2015 and 2016, modified by YouGov survey weights, of respondents using URLs that consist of news and politics articles only as well as those that explicitly remove portal sites like AOL or Yahoo.

\textbf{Figure 1} shows that most Americans, regardless of party identification, are very much overlapping in the average ideological slant of their media diets. In other words, most Americans of differing party affiliations seem to be mostly moderate and similar in their news consumption, although less distributional overlap occurs in 2016.

```{r, echo = FALSE, message = FALSE, warning = FALSE}
# Figure 1: Partisan Slant Distributions Using Guess (2021) Data and Methodology.

# 2015 distributions (portals and no portals).
plt.15.p <- yg15 %>% 
  filter(pid3 != "Other/Don't know") %>% 
  ggplot(aes(diet_mean_newspol, fill = pid3, color = pid3, weight = 
               weights/sum(weights))) + 
  geom_density(alpha = 0.5) +
  annotate(geom = "text", x = 0.75, y = 0.5, 
           label = "Republicans", size = 3, color = "firebrick1") +
  annotate(geom = "text", x = 0.2, y = 1.75, 
           label = "Independents", size = 3, color = "black") +
  annotate(geom = "text", x = -0.8, y = 1, 
           label = "Democrats", size = 3, color = "dodgerblue4") +
  scale_fill_grey() + 
  scale_color_manual(values = c("dodgerblue4",
                                "black",
                                "firebrick1")) +
  theme_classic() + 
  ggtitle("News/politics only (2015; weighted)") +
  labs(x = "Average media diet slant",
       y = "Density",
       color = "Party",
       fill = "Party") +
  lims(x = c(-1, 1)) + 
  theme(legend.position = "none")

plt.15.np <- yg15 %>% 
  filter(pid3 != "Other/Don't know") %>% 
  ggplot(aes(diet_mean_noportals, fill = pid3, color = pid3, 
             weight = weights/sum(weights))) + 
  geom_density(alpha = 0.5) +
  annotate(geom = "text", x = 0.75, y = 0.5, 
           label = "Republicans", size = 3, color = "firebrick1") +
  annotate(geom = "text", x = 0.2, y = 1.75, 
           label = "Independents", size = 3, color = "black") +
  annotate(geom = "text", x = -0.8, y = 1, 
           label = "Democrats", size = 3, color = "dodgerblue4") +
  scale_fill_grey() + 
  scale_color_manual(values = c("dodgerblue4",
                                "black",
                                "firebrick1")) +
  theme_classic() + 
  ggtitle("News/politics only, portals excluded (2015; weighted)") +
  labs(x = "Average media diet slant",
       y = "Density",
       color = "Party",
       fill = "Party") +
  lims(x = c(-1, 1)) + 
  theme(legend.position = "none")

# 2016 distributions (portals and no portals).
plt.16.p <- yg16 %>% 
  filter(pid3 != "Other/Don't know") %>% 
  ggplot(aes(diet_mean_newspol, fill = pid3, color = pid3, 
             weight = weight/sum(weight))) + 
  geom_density(alpha = 0.5) +
  annotate(geom = "text", x = 0.75, y = 0.65, 
           label = "Republicans", size = 3, color = "firebrick1") +
  annotate(geom = "text", x = 0.12, y = 1.75, 
           label = "Independents", size = 3, color = "black") +
  annotate(geom = "text", x = -0.8, y = 1, 
           label = "Democrats", size = 3, color = "dodgerblue4") +
  scale_fill_grey() + 
  scale_color_manual(values = c("dodgerblue4",
                                "black",
                                "firebrick1")) +
  theme_classic() + 
  ggtitle("News/politics only (2016; weighted)") +
  labs(x = "Average media diet slant",
       y = "Density",
       color = "Party",
       fill = "Party") +
  lims(x = c(-1, 1)) + 
  theme(legend.position = "none")

plt.16.np <- yg16 %>% 
  filter(pid3 != "Other/Don't know") %>% 
  ggplot(aes(diet_mean_noportals, fill = pid3, color = pid3, 
             weight = weight/sum(weight))) + 
  geom_density(alpha = 0.5) +
  annotate(geom = "text", x = 0.75, y = 0.65, 
           label = "Republicans", size = 3, color = "firebrick1") +
  annotate(geom = "text", x = 0.05, y = 1.75, 
           label = "Independents", size = 3, color = "black") +
  annotate(geom = "text", x = -0.8, y = 1, 
           label = "Democrats", size = 3, color = "dodgerblue4") +
  scale_fill_grey() + 
  scale_color_manual(values = c("dodgerblue4",
                                "black",
                                "firebrick1")) +
  theme_classic() + 
  ggtitle("News/politics only, portals excluded (2016; weighted)") +
  labs(x = "Average media diet slant",
       y = "Density",
       color = "Party",
       fill = "Party") +
  lims(x = c(-1, 1)) + 
  theme(legend.position = "none")

# Arrange, save, and output plot. 
plt.1 <- ggarrange(plt.15.p, plt.15.np,
                   plt.16.p, plt.16.np) %>% 
  annotate_figure(top = text_grob("Figure 1. Americans' Online Media Diets by Partisanship",
                  face = "bold",
                  size = 14))
ggsave("output/figures/figure_1.pdf", device = "pdf", width = 10, height = 8)
```
```{r, echo = F, fig.show="hold", out.width="100%", fig.align="center"}
# Load Figure 1 into the RMarkdown document. 
knitr::include_graphics("output/figures/figure_1.pdf")
```
Next, we compute the "overlapping coefficient" between the distributions of average Democrat, Republican, and Independent media diet slants. To do so, we use the same formula as Inman and Bradley Jr. (1989), Clemons and Bradley Jr. (2000), and Guess (2021):

\begin{equation}
1 - \frac {1}{2} \int\limits_{-\infty}^{+\infty} \bigg|f(x) - g(x)\bigg| dx \,\,.
\end{equation}

This simple statistic computes the shared area underneath two PDFs as a fraction of the total area. This is feasible due to the fact that any PDF integrated over its support yields a value of 1. In a sense, it quantifies the extent to which two PDFs "agree" with one another.

As shown in \textbf{Table 1}, in 2015, Democrats and Republican media diet slant distributions exhibited about 60-63% overlap, Democrats and Independents about 77-83%, and Republicans and Independents about 79-80%. In 2016, Democrat and Republican media diet slant distributions exhibited about 43-48% overlap, Democrats and Independents about 68-73%, and Republicans and Independents about 72-74%. $\Delta_P$ and $\Delta_{NP}$ capture the change in overlap coefficients between 2015 and 2016 for URLs that include and exclude portal sites, respectively. Notably, every measure of media diet slant change is negative, meaning that Americans of all party identifications became more extreme in their media diets. The extremification of media diets between 2015 and 2016 represents a finding that was relatively glossed over in Guess (2021).

```{r, echo = FALSE}
# Table 1: Overlap coefficients for each 

# 2015 distributions (portals and no portals).
overlap.15.p.dr  <- overlap(x = yg15 %>% # Overlap between Democrats and Republicans.
                                filter(pid3 == "Democrat") %>% 
                                pull(diet_mean_newspol), 
                            y = yg15 %>% 
                                filter(pid3 == "Republican") %>% 
                                pull(diet_mean_newspol))
overlap.15.p.di <- overlap(x = yg15 %>% # Overlap between Democrats and Independents.
                               filter(pid3 == "Democrat") %>% 
                               pull(diet_mean_newspol), 
                           y = yg15 %>% 
                               filter(pid3 == "Independent") %>% 
                               pull(diet_mean_newspol))
overlap.15.p.ri <- overlap(x = yg15 %>% # Overlap between Republicans and Independents.
                               filter(pid3 == "Republican") %>% 
                               pull(diet_mean_newspol), 
                           y = yg15 %>% 
                               filter(pid3 == "Independent") %>% 
                               pull(diet_mean_newspol))
overlap.15.np.dr  <- overlap(x = yg15 %>% # Overlap between Democrats and Republicans.
                                 filter(pid3 == "Democrat") %>% 
                                 pull(diet_mean_noportals), 
                             y = yg15 %>% 
                                 filter(pid3 == "Republican") %>% 
                                 pull(diet_mean_noportals))
overlap.15.np.di <- overlap(x = yg15 %>% # Overlap between Democrats and Independents.
                                filter(pid3 == "Democrat") %>% 
                                pull(diet_mean_noportals), 
                            y = yg15 %>% 
                                filter(pid3 == "Independent") %>% 
                                pull(diet_mean_noportals))
overlap.15.np.ri <- overlap(x = yg15 %>% # Overlap between Republicans and Independents.
                                filter(pid3 == "Republican") %>% 
                                pull(diet_mean_noportals), 
                            y = yg15 %>% 
                                filter(pid3 == "Independent") %>% 
                                pull(diet_mean_noportals))

# 2016 distributions (portals and noportals).
overlap.16.p.dr  <- overlap(x = yg16 %>% # Overlap between Democrats and Republicans.
                                filter(pid3 == "Democrat") %>% 
                                pull(diet_mean_newspol), 
                            y = yg16 %>% 
                                filter(pid3 == "Republican") %>% 
                                pull(diet_mean_newspol))
overlap.16.p.di <- overlap(x = yg16 %>% # Overlap between Democrats and Independents.
                               filter(pid3 == "Democrat") %>% 
                               pull(diet_mean_newspol), 
                           y = yg16 %>% 
                               filter(pid3 == "Independent") %>% 
                               pull(diet_mean_newspol))
overlap.16.p.ri <- overlap(x = yg16 %>% # Overlap between Republicans and Independents.
                               filter(pid3 == "Republican") %>% 
                               pull(diet_mean_newspol), 
                           y = yg16 %>% 
                               filter(pid3 == "Independent") %>% 
                               pull(diet_mean_newspol))
overlap.16.np.dr  <- overlap(x = yg16 %>% # Overlap between Democrats and Republicans.
                                 filter(pid3 == "Democrat") %>% 
                                 pull(diet_mean_noportals), 
                             y = yg16 %>% 
                                 filter(pid3 == "Republican") %>% 
                                 pull(diet_mean_noportals))
overlap.16.np.di <- overlap(x = yg16 %>% # Overlap between Democrats and Independents.
                                filter(pid3 == "Democrat") %>% 
                                pull(diet_mean_noportals), 
                            y = yg16 %>% 
                                filter(pid3 == "Independent") %>% 
                                pull(diet_mean_noportals))
overlap.16.np.ri <- overlap(x = yg16 %>% # Overlap between Republicans and Independents.
                                filter(pid3 == "Republican") %>% 
                                pull(diet_mean_noportals), 
                            y = yg16 %>% 
                                filter(pid3 == "Independent") %>% 
                                pull(diet_mean_noportals))
oc.matrix <- matrix(data = c(overlap.15.p.dr, overlap.15.np.dr, overlap.16.p.dr, overlap.16.np.dr,
                             overlap.16.p.dr - overlap.15.p.dr, overlap.16.np.dr - overlap.15.np.dr, 
                             overlap.15.p.di, overlap.15.np.di, overlap.16.p.di, overlap.16.np.di,
                             overlap.16.p.di - overlap.15.p.di, overlap.16.np.di - overlap.15.np.di, 
                             overlap.15.p.ri, overlap.15.np.ri, overlap.16.p.ri, overlap.16.np.ri,
                             overlap.16.p.ri - overlap.15.p.ri, overlap.16.np.ri - overlap.15.np.ri) %>% 
                      round(digits = 3),
                    nrow = 3, 
                    ncol = 6,
                    byrow = TRUE,
                    dimnames = list(c("Dem-Rep", "Dem-Ind", "Rep-Ind"),
                                    c("2015", "2015 (no portals)", "2016", "2016 (no portals)", 
                                      "$\\Delta_{P}$", "$\\Delta_{NP}$")))
kable(oc.matrix,
      align = rep('c', 6),
      format = "latex",
      booktabs = T,
      escape = FALSE,
      caption = "\\textbf{Overlap Coefficients for Americans' Partisan Media Diets}") %>% 
      kable_styling(position = "center", 
                    latex_options = "HOLD_position")
```

```{r, echo = FALSE, results = 'asis'}
# Recode the covariates. 
yg15$pid3 <- fct_relevel(yg15$pid3, "Other/Don't know")
yg15$ideo5 <- fct_relevel(yg15$ideo5, "Don't know")
yg15$race <- fct_relevel(yg15$race, "Other")
yg15$gender <- as_factor(yg15$gender)
yg15$educ <- as_factor(yg15$educ)
yg15$educ <- fct_relevel(yg15$educ, "Less than high school")
yg15$educ <- fct_relevel(yg15$educ, levels(yg15$educ)[c(1,4,2,5,3)])
yg16$pid3 <- fct_relevel(yg16$pid3, "Other/Don't know")
yg16$ideo5 <- fct_relevel(yg16$ideo5, "Don't know")
yg16$race <- fct_relevel(yg16$race, "Other")
yg16$gender <- as_factor(yg16$gender)
yg16$gender <- fct_relevel(yg16$gender, "Male")
yg16$educ <- as_factor(yg16$educ)
yg16$educ <- fct_relevel(yg16$educ, "Less than high school")
yg16$educ <- fct_relevel(yg16$educ, levels(yg16$educ)[c(1,4,2,5,3)])

# Run models. 
model.1.2015 <- lm(diet_mean_newspol ~ age + race + gender + income + educ + pid3,
                   data = yg15,
                   weights = weights)
model.1.2016 <- lm(diet_mean_newspol ~ age + race + gender + income + educ + pid3,
                   data = yg16, 
                   weights = weight)

# Sequester variable names.
varnames <- str_replace(str_replace(str_replace(str_replace(str_replace(
  str_replace(attr(summary(model.1.2015)$coefficients, "dimnames")[[1]], 
              "age", "Age: "), "race", "Race: "), "educ", ""), "pid3", ""), 
  "gender", ""), "income", "Income level")
```

## Models of Partisan Media Consumption in Guess (2021) Exhibit Model Dependence

Notably, Guess' (2021) main model of media diet consumption (replicated in Appendix A.1) includes respondents' party identification, but not ideology. We are interested in seeing whether these models exhibit symptoms of model dependence and whether the inclusion of respondent ideology could alleviate the model dependence. 

To do so, we introduce two simple, descriptive "bellwether" statistics for glimpsing hints of model dependence, the mean absolute uncertainty deviation (MAUD) and the standardized mean absolute uncertainty deviation (SMAUD), defined as follows: 

\begin{equation}
\text{MAUD} = \frac{1}{n}\sum\limits_{i=1}^{n}|\sigma_{i_{robust}} - \sigma_{i_{vanilla}}| \,\,,
\end{equation}

\begin{equation}
\text{SMAUD} = \frac{1}{n}\sum\limits_{i=1}^{n}\frac{|\sigma_{i_{robust}} - \sigma_{i_{vanilla}}|}{\sigma_{i_{vanilla}}} \,\,.
\end{equation}

Here, $\sigma_{i_{robust}}$ and $\sigma_{i_{vanilla}}$ are the coefficient standard errors for parametric linear regression models estimated using the heteroskedasticity-robust estimator and classical regression standard errors, respectively. These estimators take form from the assertion in King & Roberts (2015) that "robust and classical standard errors that differ need to be seen as bright red flags that signal compelling evidence of uncorrected model misspecification," (p. 159).\footnote{We have not yet attempted to rigorously prove or disprove any properties of these statistics as they pertain to model dependence, so they should be treated with extreme caution and skepticism.}

We also employ the generalized information matrix (GIM) test proposed by King & Roberts (2015), which uses a parametric boostrap-based information matrix test to diagnose model dependence. As Aronow (2016) notes, King and Roberts (2015) is valid only when the researcher specifies a fully parametric regression model. Guess (2021) does exactly that, specifying the linear regression model with finite-dimensional parameterization for media diet slant, $Y_i$, as follows: 
\begin{equation}
Y_i = \alpha + {\beta_1}{\text{Democrat}_i} + {\beta_2}{\text{Republican}_i} + {\beta_3}{\text{Independent}_i} + \gamma{\boldsymbol{X}_i} + \epsilon_i \,,
\end{equation}
where $\boldsymbol{X}_i$ is a "vector of demographic characteristics that includes age, race, gender, family income, and educational attainment," (p. 1016). 

When MAUD, SMAUD, and GIM test statistics are computed in \textbf{Table 2}, we find clear evidence of model dependence in Guess (2021). In the next section, we attempt to address and alleviate this model dependence using a measure of respondent ideology. 

```{r, echo = FALSE}
# Table 2: MAUDs, SMAUDs, and GIM Test results on original models. 

# Function to compute MAUD and SMAUD. Takes two vectors of coefficient standard 
# errors and a binary variable for standardization and outputs the desired
# summary statistic. 
maud.gen <- function(se.1, se.2, standardize = 0) {
  se.1 <- as.numeric(se.1)
  se.2 <- as.numeric(se.2)
  if (standardize) {
    statistic <- mean(abs(se.1-se.2)/se.1)
  } else {
    statistic <- mean(abs(se.1-se.2))
  }
  return(statistic)
}

# Compute MAUDs and SMAUDs for each model. 
maud.2015.1 <- maud.gen(starprep(model.1.2015)[[1]],
                        summary(model.1.2015)$coefficients[,2])
maud.2016.1 <- maud.gen(starprep(model.1.2016)[[1]],
                        summary(model.1.2016)$coefficients[,2])
smaud.2015.1 <- maud.gen(starprep(model.1.2015)[[1]],
                         summary(model.1.2015)$coefficients[,2], 1)
smaud.2016.1 <- maud.gen(starprep(model.1.2016)[[1]],
                         summary(model.1.2016)$coefficients[,2], 1)

# Transfer models to GLM form for incorporation in GIM Tests.
glm.1.2015 <- glm(diet_mean_newspol ~ age + race + gender + income + educ + pid3,
                  data = yg15,
                  weights = weights)
glm.1.2016 <- glm(diet_mean_newspol ~ age + race + gender + income + educ + pid3,
                  data = yg16, 
                  weights = weight)
```
```{r, eval = FALSE, echo = FALSE}
# Compute generalized information matrix (GIM) tests. 
GIM.1.2015 <- GIM(glm.1.2015, full = TRUE, B = 30, B2 = 25)
GIM.1.2016 <- GIM(glm.1.2016, full = TRUE, B = 30, B2 = 25)
```
```{r, echo = FALSE}
# Load tediously computed GIMs. 
load("input/GIM/GIM_1_2015.RData")
load("input/GIM/GIM_1_2016.RData")

# Function to extract desirable components of GIM object.
GIM.extract <- function(GIM.object) {
  rot <- GIM.object$`Rule of Thumb`
  tstat <- GIM.object$`GIM test statistic`
  pval <- GIM.object$`GIM pval`
  return(c(rot, 
           tstat,
           pval))
}

# Output results in table. 
dep.mat <- matrix(data = c(maud.2015.1, smaud.2015.1, GIM.extract(GIM.1.2015),
                           maud.2016.1, smaud.2016.1, GIM.extract(GIM.1.2016)) %>% 
                             round(digits = 3),
                  nrow = 2, 
                  ncol = 5, 
                  byrow = TRUE,
                  dimnames = list(c("2015", "2016"),
                                  c("MAUD", "SMAUD", "(GIM) Rule of Thumb", "Test Statistic", "\\textit{p}-value")))
kable(dep.mat,
      align = rep('c', 5),
      format = "latex",
      booktabs = T,
      escape = FALSE,
      caption = "\\textbf{Evidence for Model Dependence in Main Consumption Determinants Models}") %>% 
  kable_styling(position = "center", 
                latex_options = "HOLD_position")
```

## Including Respondent Ideology Reduces Model Dependence in Guess (2021)

We extend the Guess (2021) parametric linear model of media diet slant (4) by including respondent ideology as follows: 
\begin{align}
Y_i = \alpha &+ {\beta_1}{\text{Democrat}_i} + {\beta_2}{\text{Republican}_i} + {\beta_3}{\text{Independent}_i} + {\mu_1}{\text{Conservative}_i} \nonumber \\ &+{\mu_2}{\text{Liberal}_i} + {\mu_3}{\text{Moderate}_i} + {\mu_4}{\text{Very Conservative}_i} \\ &+ {\mu_5}{\text{Very Liberal}_i} + \gamma{\boldsymbol{X}_i} + \epsilon_i \nonumber \,.
\end{align}

The results of this extended model can be found in Appendix A.2. When ideology is included in the model, the coefficient significance attributed to party identification variables disappears in favor of the ideology variables.

```{r, echo = FALSE, results = 'asis'}
# Models that account for both partisanship and ideology. 

# Tyler, Grimmer, Iyengar (2016) data acquisition procedure (see code in subsequent 
# parts for detailed derivations).
news <- readRDS("input/TGI_Data/TGI2021NewsMerge.rds")

just_domain <- function(x) suffix_extract(url_parse(x)$domain)$domain

alexa355 <- read.csv("input/TGI_Data/politicaldomains-final-edits.csv") %>%
  select(domain, type)
alexa355$domain <- just_domain(alexa355$domain)
alexa355 <- alexa355 %>% mutate(
  type = as.character(type),
  type = ifelse(type == "aggregator", "portal", type))

gdf <- news %>% select(caseid, avg_align, weight,
  age, race, female, income, ideology, newsint,
  educ, pid)

mean_align <- gdf %>% select(caseid, avg_align) %>%
  group_by(caseid) %>% 
  dplyr::summarize(mean_align = mean(avg_align, na.rm = TRUE))

gdf <- mean_align %>%
  left_join(gdf %>% select(-avg_align) %>% distinct(),
    by = "caseid") %>% 
  distinct()

age_levels <- paste0("Age: ", c("Under 30", "30-44", "45-59", "60+"))
race_levels <- paste0("Race: ", c("Other", "White", "Black", "Hispanic"))
educ_levels <- paste0("Educ.: ", c("Some or no HS", "High school", "Some college", "College graduate", "Postgraduate"))
party_levels <- paste0("Party: ", c("Independent", "Democrat",
  "Republican"))
ideo_levels <- paste0("Ideo.: ", c("Moderate", "Liberal", "Conservative", "Very conservative", "Very liberal"))

inc_levels <- c("$10,000 - $19,999",
  "$20,000 - $29,999", "$50,000 - $59,999", 
  "$70,000 - $79,999",   "$30,000 - $39,999",
  "$120,000 - $149,999",
  "$80,000 - $99,999",   "$150,000 - $199,999", "$40,000 - $49,999",
  "$100,000 - $119,999", "$60,000 - $69,999",   "$150,000 or more",
  "Less than $10,000", "$350,000 - $499,999", "$200,000 - $249,999",
  "$250,000 - $349,999")

inc_levels <- inc_levels[c(13, 1, 2, 5, 9, 3, 11, 4,
  7, 10, 6, 8, 15, 16)]

ideo_levels <- c("Conservative", "Liberal", "Moderate",
  "Very conservative","Very liberal", "Don't know")
ideo_levels <- ideo_levels[c(6, 1, 2, 3, 4, 5)]

ideo5 <- addNA(gdf$ideology)
levels(ideo5) <- c(levels(gdf$ideology), "Don't know")
gdf$ideo5 <- ideo5
gdf <- gdf %>% mutate(
  Age = factor(case_when(
    age < 30 ~ age_levels[1],
    age >= 30 & age <= 45 ~ age_levels[2],
    age >= 45 & age <= 59 ~ age_levels[3],
    age >= 60 ~ age_levels[4]), levels = age_levels),
  Race = factor(case_when(
    race == "White" ~ race_levels[2],
    race == "Black" ~ race_levels[3],
    race == "Hispanic" ~ race_levels[4],
    TRUE ~ race_levels[1]), levels = race_levels),
  Educ = factor(case_when(
    educ == "No HS" ~ educ_levels[1],
    educ == "High school graduate" ~ educ_levels[2],
    educ %in% c("Some college", "2-year") ~ educ_levels[3],
    educ == "4-year" ~ educ_levels[4],
    educ == "Post-grad" ~ educ_levels[5]), levels = educ_levels),
  Party = factor(case_when(
    pid <= 3 ~ party_levels[2],
    pid == 4 ~ party_levels[1],
    pid >= 5 ~ party_levels[3]), levels = party_levels),
  Income = factor(income, levels = inc_levels),
  Income_Numeric = as.integer(Income),
  Ideology = factor(ideo5, levels = ideo_levels),
  Ideology_Numeric = as.integer(Ideology),
  Female = 1 * (female == "Female"),
)

# Estimate different model specifications including . 
model.2.2015 <- lm(diet_mean_newspol ~ age + race + gender + income + educ + ideo5,
                   data = yg15,
                   weights = weights)
model.3.2015 <- lm(diet_mean_newspol ~ age + race + gender + income + educ + pid3 +
                     ideo5,
                   data = yg15,
                   weights = weights)
model.2.2016 <- lm(diet_mean_newspol ~ age + race + gender + income + educ + ideo5,
                   data = yg16, 
                   weights = weight)
model.3.2016 <- lm(diet_mean_newspol ~ age + race + gender + income + educ + pid3 +
                   ideo5,
                   data = yg16, 
                   weights = weight)

varnames <- str_replace(str_replace(str_replace(str_replace(str_replace(str_replace(
  str_replace(attr(summary(model.3.2015)$coefficients, "dimnames")[[1]], 
              "age", "Age: "), "race", "Race: "), "educ", ""), "pid3", ""), 
  "gender", ""), "income", "Income level"), "ideo5", "")

# Output model results in stargazer table. 
# stargazer(model.1.2015,
#           model.2.2015,
#           model.3.2015,
#           model.1.2016,
#           model.2.2016,
#           model.3.2016,
#           style = "apsr", 
#           title = "\\textbf{Determinants of Media Diet Slant (Including Ideology)}",
#           dep.var.labels = "Average media diet slant (news/politics only)", 
#           column.labels = c("\\emph{2015} (PID)", 
#                             "\\emph{2015} (Ideo)",
#                             "\\emph{2015} (Combined)",
#                             "\\emph{2016} (PID)",
#                             "\\emph{2016} (Ideo)",
#                             "\\emph{2016} (Combined)"), 
#           se = starprep(model.1.2015, model.2.2015, model.3.2015,
#                         model.1.2016, model.2.2016, model.3.2016),
#           p = starprep(model.1.2015, model.2.2015, model.3.2015,
#                        model.1.2016, model.2.2016, model.3.2016,
#                        stat = "p.value"),
#           omit = c("Constant",
#                    "age30-44", "age45-59", "age60+", 
#                    "raceBlack", "raceHispanic", "raceWhite", 
#                    "genderFemale", "income", "educHigh school",
#                    "educSome college", "educCollege graduate", 
#                    "educPostgraduate"),
#           covariate.labels = varnames[14:length(varnames)],
#           omit.stat = c("f", "ser", "rsq"), column.sep.width = "0pt",
#           star.cutoffs = c(.05, .01),
#           align = TRUE,
#           notes = c("OLS regressions with HC2 robust standard errors in parentheses; YouGov survey data with weights applied."),
#           add.lines = list("",
#                            c("Party ID", "\\checkmark", "", "\\checkmark", 
#                              "\\checkmark", "", "\\checkmark"),
#                            c("Ideology", "", "\\checkmark", "\\checkmark", 
#                              "", "\\checkmark", "\\checkmark")),
#           notes.append = TRUE,
#           header = FALSE,
#           float = TRUE,
#           table.placement = "H",
#           font.size = "scriptsize")
```

But what of model dependence? \textbf{Figure 2} shows that including ideology in models of media diet slant decrease model dependence across all three of the previously outlined statistics for detecting model dependence.\footnote{One exception is the GIM Test statistic in 2016. Perhaps the 2016 election made some partisans more ideological in orientation?}

```{r, echo = FALSE}
# Compute MAUDs and SMAUDs for various model types. 
maud.2015.2 <- maud.gen(starprep(model.2.2015)[[1]], 
                        summary(model.2.2015)$coefficients[,2])
maud.2015.3 <- maud.gen(starprep(model.3.2015)[[1]],
                        summary(model.3.2015)$coefficients[,2])
maud.2016.2 <- maud.gen(starprep(model.2.2016)[[1]], 
                        summary(model.2.2016)$coefficients[,2])
maud.2016.3 <- maud.gen(starprep(model.3.2016)[[1]],
                        summary(model.3.2016)$coefficients[,2])
smaud.2015.2 <- maud.gen(starprep(model.2.2015)[[1]], 
                         summary(model.2.2015)$coefficients[,2], 1)
smaud.2015.3 <- maud.gen(starprep(model.3.2015)[[1]],
                         summary(model.3.2015)$coefficients[,2], 1)
smaud.2016.2 <- maud.gen(starprep(model.2.2016)[[1]], 
                         summary(model.2.2016)$coefficients[,2], 1)
smaud.2016.3 <- maud.gen(starprep(model.3.2016)[[1]],
                         summary(model.3.2016)$coefficients[,2], 1)
mauds <- c(maud.2015.1, maud.2015.2, maud.2015.3,
           maud.2016.1, maud.2016.2, maud.2016.3)
smauds <- c(smaud.2015.1, smaud.2015.2, smaud.2015.3,
            smaud.2016.1, smaud.2016.2, smaud.2016.3)

# Estimate GLM versions of models for inclusion in the GIM test functions. 
glm.2.2015 <- glm(diet_mean_newspol ~ age + race + gender + income + educ + ideo5,
                  data = yg15,
                  weights = weights)
glm.3.2015 <- glm(diet_mean_newspol ~ age + race + gender + income + educ + pid3 +
                     ideo5,
                  data = yg15,
                  weights = weights)
glm.2.2016 <- glm(diet_mean_newspol ~ age + race + gender + income + educ + ideo5,
                  data = yg16, 
                  weights = weight)
glm.3.2016 <- glm(diet_mean_newspol ~ age + race + gender + income + educ + pid3 +
                  ideo5,
                  data = yg16, 
                  weights = weight)
```

```{r, echo = FALSE, eval=FALSE}
# Compute GIM tests for above models. 
GIM.2.2015 <- GIM(glm.2.2015, full = TRUE, B = 30, B2 = 25)
GIM.3.2015 <- GIM(glm.3.2015, full = TRUE, B = 30, B2 = 25)
GIM.2.2016 <- GIM(glm.2.2016, full = TRUE, B = 30, B2 = 25)
GIM.3.2016 <- GIM(glm.3.2016, full = TRUE, B = 30, B2 = 25)
```

```{r, echo = FALSE}
# Figure 2. Decreasing Model Dependence Through Incorporation of Respondent Ideology

# Load tediously computed GIM test objects.  
load("input/GIM/GIM_2_2015.RData")
load("input/GIM/GIM_3_2015.RData")
load("input/GIM/GIM_2_2016.RData")
load("input/GIM/GIM_3_2016.RData")

# MAUD/SMAUD subfigures. 
unc.dat <- data.frame(maud = mauds, 
                     smaud = smauds, 
                     rot = c(GIM.1.2015$`Rule of Thumb`,
                             GIM.2.2015$`Rule of Thumb`,
                             GIM.3.2015$`Rule of Thumb`,
                             GIM.1.2016$`Rule of Thumb`,
                             GIM.2.2016$`Rule of Thumb`,
                             GIM.3.2016$`Rule of Thumb`),
                     model = rep(c("PID", "Ideo", "Combo"), 2),
                     year = c(rep(2015, 3),
                              rep(2016, 3)))

m.plt <- unc.dat %>% 
  ggplot(aes(x = maud, y = model, label = model)) +
  geom_point(size = 3, shape = 24, fill = "blue", color = "darkred") +
  geom_vline(xintercept = 0, lty = "dotted") + 
  facet_grid(~year) +
  labs(x = "MAUD",
       y = "Model", 
       title = "MAUDs by Model Specification") + 
  theme_pubr()

sm.plt <- unc.dat %>% 
  ggplot(aes(x = smaud, y = model, label = model)) +
  geom_point(size = 3, shape = 23, fill = "blue", color = "darkred") +
  geom_vline(xintercept = 0, lty = "dotted") + 
  facet_grid(~year) +
  labs(x = "SMAUD",
       y = "Model", 
       title = "SMAUDs by Model Specification") + 
  theme_pubr()
  
# GIM statistics subfigures. 
rot.plt <- unc.dat %>% 
  ggplot(aes(x = rot, y = model, label = model)) +
  geom_point(size = 3, shape = 22, fill = "blue", color = "darkred") +
  geom_vline(xintercept = 0, lty = "dotted") + 
  facet_grid(~year) +
  labs(x = "(GIM) Test Statistics",
       y = "Model", 
       title = "(GIM) Test Statistics by Model Specification") + 
  theme_pubr()

# Arrange, save, and output plot. 
plt.2 <- ggarrange(m.plt,
                   sm.plt,
                   rot.plt,
                   nrow = 3) %>% 
  annotate_figure(top = text_grob("Figure 2. Inclusion of Ideology Reduces Model Dependence",
                  face = "bold",
                  size = 14))
ggsave("output/figures/figure_2.pdf", device = "pdf", width = 8, height = 10)
```
```{r, echo = F, fig.show="hold", out.width="90%", fig.align="center"}
# Load Figure 1 into the RMarkdown document. 
knitr::include_graphics("output/figures/figure_2.pdf")
```

# Evidence of Ideological Extremification in the American Media Diet

```{r, echo = FALSE, warning = FALSE, message = FALSE}
# Figure 3. Americans' Media Diets by Ideology.

# Mutate data.
yg15 <- yg15 %>% # 2015
  mutate(pol.class = case_when(pid3 == "Republican" & ideo5 == "Very conservative" ~ "Rep Ideologue", 
 pid3 == "Democrat" & ideo5 == "Very liberal" ~ "Dem Ideologue",
 pid3 == "Republican" & ideo5 == "Conservative" ~ "Rep Near Ideologue",
 pid3 == "Democrat" & ideo5 == "Liberal" ~ "Dem Near Ideologue",
 (ideo5 == "Moderate" | ideo5 == "Don't know") & (pid3 == "Democrat" | pid3 == "Republican") ~ "Moderate Partisan", 
 pid3 == "Independent" & ideo5 == "Moderate" ~ "True Moderate",
 TRUE ~ "No issue content")) %>% 
  mutate(pol.class = factor(pol.class))
yg15$pol.class <- fct_relevel(yg15$pol.class, "No issue content")
yg16 <- yg16 %>% # 2016
  mutate(pol.class = case_when(pid3 == "Republican" & ideo5 == "Very conservative" ~ "Rep Ideologue", 
 pid3 == "Democrat" & ideo5 == "Very liberal" ~ "Dem Ideologue",
 pid3 == "Republican" & ideo5 == "Conservative" ~ "Rep Near Ideologue",
 pid3 == "Democrat" & ideo5 == "Liberal" ~ "Dem Near Ideologue",
 (ideo5 == "Moderate" | ideo5 == "Don't know") & (pid3 == "Democrat" | pid3 == "Republican") ~ "Moderate Partisan", 
 pid3 == "Independent" & ideo5 == "Moderate" ~ "True Moderate",
 TRUE ~ "No issue content")) %>% 
  mutate(pol.class = factor(pol.class))
yg16$pol.class <- fct_relevel(yg16$pol.class, "No issue content")

# Compute ideological overlap coefficients.
overlap.1 <- overlap(x = yg15 %>% # Overlap between Democrats and Republicans.
                          filter(pol.class == "Dem Ideologue") %>%
                          pull(diet_mean_newspol),
                      y = yg15 %>%
                          filter(pol.class == "Rep Ideologue") %>%
                          pull(diet_mean_newspol)) %>% 
             round(digits = 2)
overlap.2 <- overlap(x = yg15 %>% # Overlap between Democrats and Republicans.
                          filter(pol.class == "Dem Near Ideologue") %>%
                          pull(diet_mean_newspol),
                      y = yg15 %>%
                          filter(pol.class == "Rep Near Ideologue") %>%
                          pull(diet_mean_newspol)) %>% 
             round(digits = 2)
overlap.3 <- overlap(x = yg16 %>% # Overlap between Democrats and Republicans.
                          filter(pol.class == "Dem Ideologue") %>%
                          pull(diet_mean_newspol),
                      y = yg16 %>%
                          filter(pol.class == "Rep Ideologue") %>%
                          pull(diet_mean_newspol)) %>% 
             round(digits = 2)
overlap.4 <- overlap(x = yg16 %>% # Overlap between Democrats and Republicans.
                          filter(pol.class == "Dem Near Ideologue") %>%
                          pull(diet_mean_newspol),
                      y = yg16 %>%
                          filter(pol.class == "Rep Near Ideologue") %>%
                          pull(diet_mean_newspol)) %>% 
             round(digits = 2)

# Generate plots.
plt.15.p.i <- yg15 %>%
  filter(pol.class == "Rep Ideologue" | pol.class == "Dem Ideologue") %>% 
  ggplot(aes(diet_mean_newspol, fill = pol.class, color = pol.class,
             weight = weights/sum(weights))) +
  geom_density(alpha = 0.5) +
  annotate(geom = "text", x = 0.6, y = 0.75,
           label = "Republicans", size = 3, color = "red") +
  annotate(geom = "text", x = -0.8, y = 1.5,
           label = "Democrats", size = 3, color = "blue") +
  annotate(geom = "text", x = 0.5, y = 1.5,
           label = paste("Overlap: ", overlap.1), size = 3) +
  scale_fill_grey() +
  scale_color_manual(values = c("blue",
                                "red")) +
  theme_classic() +
  ggtitle("Ideologues (2015; weighted)") +
  labs(x = "Average media diet slant",
       y = "Density",
       color = "Ideological Class",
       fill = "Ideological Class") +
  lims(x = c(-1, 1)) +
  theme(legend.position = "none")

plt.15.p.ni <- yg15 %>%
  filter(pol.class == "Rep Near Ideologue" | pol.class == "Dem Near Ideologue") %>% 
  ggplot(aes(diet_mean_newspol, fill = pol.class, color = pol.class,
             weight = weights/sum(weights))) +
  geom_density(alpha = 0.5) +
  annotate(geom = "text", x = 0.6, y = 0.7,
           label = "Republicans", size = 3, color = "firebrick1") +
  annotate(geom = "text", x = -0.775, y = 1.3,
           label = "Democrats", size = 3, color = "dodgerblue4") +
  annotate(geom = "text", x = 0.5, y = 1.5,
           label = paste("Overlap: ", overlap.2), size = 3) +
  scale_fill_grey() +
  scale_color_manual(values = c("dodgerblue4",
                                "firebrick1")) +
  theme_classic() +
  ggtitle("Near Ideologues (2015; weighted)") +
  labs(x = "Average media diet slant",
       y = "Density",
       color = "Ideological Class",
       fill = "Ideological Class") +
  lims(x = c(-1, 1)) +
  theme(legend.position = "none")

plt.15.p.m <- yg15 %>%
  filter(pol.class == "Moderate Partisan" | pol.class == "True Moderate" | pol.class == "No issue content") %>% 
  ggplot(aes(diet_mean_newspol, fill = pol.class, color = pol.class,
             weight = weights/sum(weights))) +
  geom_density(alpha = 0.5) +
  annotate(geom = "text", x = 0.35, y = 1.90,
           label = "Moderate Partisans", size = 3, color = "forestgreen") +
  annotate(geom = "text", x = -0.75, y = 1.5,
           label = "True Moderates", size = 3, color = "darkorchid4") +
  annotate(geom = "text", x = 0.75, y = 0.5,
           label = "No Issue Content", size = 3, color = "black") +
  scale_fill_grey() +
  scale_color_manual(values = c("black", 
                                "forestgreen",
                                "darkorchid4")) +
  theme_classic() +
  ggtitle("Moderates (2015; weighted)") +
  labs(x = "Average media diet slant",
       y = "Density",
       color = "Ideological Class",
       fill = "Ideological Class") +
  lims(x = c(-1, 1)) +
  theme(legend.position = "none")

plt.16.p.i <- yg16 %>%
  filter(pol.class == "Rep Ideologue" | pol.class == "Dem Ideologue") %>% 
  ggplot(aes(diet_mean_newspol, fill = pol.class, color = pol.class,
             weight = weight/sum(weight))) +
  geom_density(alpha = 0.5) +
  annotate(geom = "text", x = 0.6, y = 1.05,
           label = "Republicans", size = 3, color = "red") +
  annotate(geom = "text", x = -0.65, y = 2,
           label = "Democrats", size = 3, color = "blue") +
  annotate(geom = "text", x = 0.5, y = 1.5,
           label = paste("Overlap: ", overlap.3), size = 3) +
  scale_fill_grey() +
  scale_color_manual(values = c("blue",
                                "red")) +
  theme_classic() +
  ggtitle("Ideologues (2016; weighted)") +
  labs(x = "Average media diet slant",
       y = "Density",
       color = "Ideological Class",
       fill = "Ideological Class") +
  lims(x = c(-1, 1)) +
  theme(legend.position = "none")

plt.16.p.ni <- yg16 %>%
  filter(pol.class == "Rep Near Ideologue" | pol.class == "Dem Near Ideologue") %>% 
  ggplot(aes(diet_mean_newspol, fill = pol.class, color = pol.class,
             weight = weight/sum(weight))) +
  geom_density(alpha = 0.5) +
  annotate(geom = "text", x = 0.6, y = 0.7,
           label = "Republicans", size = 3, color = "firebrick1") +
  annotate(geom = "text", x = -0.7, y = 1.5,
           label = "Democrats", size = 3, color = "dodgerblue4") +
  annotate(geom = "text", x = 0.5, y = 1.5,
           label = paste("Overlap: ", overlap.4), size = 3) +
  scale_fill_grey() +
  scale_color_manual(values = c("dodgerblue4",
                                "firebrick1")) +
  theme_classic() +
  ggtitle("Near Ideologues (2016; weighted)") +
  labs(x = "Average media diet slant",
       y = "Density",
       color = "Ideological Class",
       fill = "Ideological Class") +
  lims(x = c(-1, 1)) +
  theme(legend.position = "none")

plt.16.p.m <- yg16 %>%
  filter(pol.class == "Moderate Partisan" | pol.class == "True Moderate" | pol.class == "No issue content") %>% 
  ggplot(aes(diet_mean_newspol, fill = pol.class, color = pol.class,
             weight = weight/sum(weight))) +
  geom_density(alpha = 0.5) +
  annotate(geom = "text", x = 0.4, y = 1.90,
           label = "Moderate Partisans", size = 3, color = "forestgreen") +
  annotate(geom = "text", x = -0.65, y = 1.5,
           label = "True Moderates", size = 3, color = "darkorchid4") +
  annotate(geom = "text", x = 0.75, y = 0.5,
           label = "No Issue Content", size = 3, color = "black") +
  scale_fill_grey() +
  scale_color_manual(values = c("black", 
                                "forestgreen",
                                "darkorchid4")) +
  theme_classic() +
  ggtitle("Moderates (2016; weighted)") +
  labs(x = "Average media diet slant",
       y = "Density",
       color = "Ideological Class",
       fill = "Ideological Class") +
  lims(x = c(-1, 1)) +
  theme(legend.position = "none")

# Arrange and save the ideological class plots. 
plt.ideotypes <- ggarrange(plt.15.p.i, plt.16.p.i,
                           plt.15.p.ni, plt.16.p.ni,
                           plt.15.p.m, plt.16.p.m,
                           nrow = 3,
                           ncol = 2) %>% 
  annotate_figure(top = text_grob("Figure 3. Americans' Media Diets by Ideological Type",
                  face = "bold",
                  size = 14))
ggsave("output/figures/figure_3.pdf", device = "pdf", width = 8, height = 11)

# Compute cross-tabs of respondent types. 
pc.2015 <- (sort(prop.table(table(yg15$pol.class))) * 100) %>% round(digits = 3)
pc.2016 <- (sort(prop.table(table(yg16$pol.class))) * 100) %>% round(digits = 3)
```

Following the results providing initial evidence for the influence of ideology instead of or in addition to party identification on Americans' media consumption, we investigate how the interpretation of Guess' (2021) findings change if respondents are stratified by ideology rather than partisanship. 

To do so, we follow ideological strata outlined by Converse (1964) in his seminal work, "The Nature of Belief Systems in Mass Publics."\footnote{ Converse found that individuals could be categorized as ideologues (those who use ideology in their evaluations of policies and politicians), near ideologues (those who mention items that fall along the liberal-conservative dimension, but place less emphasis on them), group interest voters (those who vote with parties as social identity), nature of the times voters (those who made decisions based on whatever social or economic conditions like the Korean War were ongoing at the time), and no issue content voters (those whose evaluations had no shred of coherence or policy relevance at all).} We incorporate the spirit of Converse (1964) in dividing respondents into "ideologues" (Republicans who identify as "very conservative" and Democrats who identify as "very liberal"), "near ideologues" (Republicans and Democrats who are respectively "conservative" and "liberal"), "moderate partisans" (Republicans and Democrats who are ideologically "moderate"), "true moderates" (political Independents who are "moderate"), and "no issue content" (people who answered "Don't know" or who refused to respond to questions). When tabulated (Appendix A.3), true moderates, moderate partisans, and no issue content voters vastly outweigh near ideologues and ideologues.

When plotted in \textbf{Figure 3}, we notice that respondents who are more ideological in nature consume much more extreme online media along the left-right ideological continuum, with each subclass becoming more extreme in 2016 than 2015.\footnote{Ideologues exhibit overlap coefficients between 0.27-0.39 and near ideologues between 0.45-0.64.} Guess (2021) may have generated results suggesting moderate media diet slants due to the outsize presence of moderates in the survey sample. Indeed, we find that ideology and partisanship have correlations of 0.48 in 2015 and 0.57 in 2016 using the Guess (2021) data.

```{r, echo = F, fig.show="hold", out.width="100%", fig.align="center"}
# Output above plot. 
knitr::include_graphics("output/figures/figure_3.pdf")
```

## Using URL Section Name Proxies Produces Different Results Than Logistic Classification for News/Not News

```{r, echo = FALSE, eval = FALSE}
# Implement  Tyler, Grimmer, & Iyenger's (2021) 2016 URL name proxy procedure.
urls <- read_csv("input/TGI_Data/news_domain_urls.csv") # Data from Peterson & Iyengar (2019).

urls <- urls %>% 
  select(-used_at) %>% 
  na.omit()
urls <- urls %>% 
  mutate(type = ifelse(type == "agregator", "portal", type))
urls %>% 
  pull(caseid) %>% 
  n_distinct()

parsed <- urls %>% 
  pull(url) %>% 
  urltools::url_parse() %>% 
  tibble()
suffix <- parsed %>% 
  pull(domain) %>% 
  urltools::suffix_extract() %>% 
  tibble()
urls <- bind_cols(urls %>% select(-domain, -path),
  parsed %>% 
  select(path), suffix) %>%
  select(-suffix, -subdomain, -host) %>% 
  mutate(path = if_else(is.na(path), "", path)) %>% 
  filter(!is.na(domain))

political_keywords <- read_csv("input/TGI_Data/PoliticalKeyWords.csv",
  col_names = FALSE)
political_keywords <- political_keywords %>%
  pull(X1) %>% str_c(collapse = "|")

urls <- urls %>% 
  mutate(political = 1 * str_detect(path, political_keywords),
  political = ifelse(is.na(political), 0, political))

urls <- urls %>% mutate(portal = 1 * (type == "portal"))

set.seed(pi + 4)

get_proper_domain <- function(x) {
  urltools::suffix_extract(urltools::url_parse(x)$domain)$domain
}

# Merge with survey data.
news <- urls

survey <- read_sav("input/TGI_Data/STAN0089_OUTPUT_all.sav")

survey <- survey %>%
  mutate(
    internet_primary = case_when(
    main_news_source_2_pre == 1 ~ 1,
    main_news_source_3_pre == 1 ~ 1,
    main_news_source_4_pre == 1 ~ 1,
    main_news_source_5_pre == 1 ~ 1,
    main_news_source_6_pre == 1 ~ 1,
    main_news_source_7_pre == 1 ~ 1,
    TRUE ~ 0)
    )

survey <- survey %>%
  mutate(
  caseid = as.integer(caseid),
  pid = pid7_pre %>% zap_labels() %>% na_if(8),
  party = case_when(
    pid %in% 1:3 ~ "dem",
    pid %in% 5:7 ~ "rep",
    TRUE ~ "ind") %>% as_factor(),
  income = faminc_pre %>% na_if(97),
  female = gender_pre,
  age = (2016 - birthyr_pre) %>% as.integer(), 
  educ = educ_pre,
  ideology = ideo5_pre %>% na_if(6),
  clinton = case_when(
    as_factor(presvote16_post) == "Hillary Clinton" ~ 1,
    TRUE ~ 0),
  trump = case_when(
    as_factor(presvote16_post) == "Donald Trump" ~ 1,
    TRUE ~ 0),
  voted_major = clinton + trump,
  abortion = Q26_pre,
  refugees = Q27_pre,
  primary = case_when(
    Q2_pre == 1 ~ 1,
    TRUE ~ 0),
  newsint = 1 * (Q31_pre == 1),
  race = case_when(
    race_pre == 1 ~ "White",
    race_pre == 2 ~ "Black",
    race_pre == 3 ~ "Hispanic",
    race_pre == 4 ~ "Asian",
    TRUE ~ "Other"
    ) %>% as_factor(),
  weight = weight_pulse %>% as.numeric(),
  ) %>% 
  select(caseid, pid, party, income, female, age, educ,
    ideology, clinton, trump, voted_major, refugees, abortion,
    primary, newsint, race, weight) %>% 
  as_factor() %>% 
  mutate_if(is.factor,  fct_drop)

news <- news %>% 
  left_join(survey, by = "caseid")

foo <- news %>% group_by(caseid) %>%
  summarise(pfrac = mean(portal)) %>% 
  left_join(survey, by = "caseid")

foo <- foo %>% group_by(pid) %>%
  summarise(m = mean(pfrac),
    se = sd(pfrac) / sqrt(n())) %>% 
  mutate(lwr = m - 2 * se, upr = m + 2 * se)

pid_levels <- c("Strong DEM", "Weak DEM", "Lean DEM",
  "Pure IND", "Lean REP", "Weak REP", "Strong REP")

foo <- foo %>% mutate(pid = factor(case_when(
  pid == 1 ~ "Strong DEM",
  pid == 2 ~ "Weak DEM",
  pid == 3 ~ "Lean DEM",
  pid == 4 ~ "Pure IND",
  pid == 5 ~ "Lean REP",
  pid == 6 ~ "Weak REP",
  pid == 7 ~ "Strong REP"),
  levels = pid_levels)) %>% 
  filter(!is.na(pid))

bakshy <- read_csv("input/TGI_Data/top500.csv") %>%
  select(domain, avg_align)
bakshy <- bakshy %>% mutate(
  domain = str_remove(domain, "www.")) %>% 
  distinct(domain, .keep_all = TRUE)

bakshy_domains <- bakshy$domain
proper_bakshy_domains <- bakshy$domain
bakshy_align <- bakshy$avg_align

find_bakshy_align <- function(url_list) {
  found <- map(url_list,
    ~ bakshy_domains[str_detect(., bakshy_domains)])
  for (i in 1:length(found)) {
    x <- found[[i]]
    actual_proper <- get_proper_domain(url_list[i])
    potential_proper <- get_proper_domain(x)
    found[[i]] <- x[potential_proper == actual_proper]
  }
  
  found <- map(found,
    ~ ifelse(identical(., character(0)), NA, .))
  found <- map_chr(found, ~
    ifelse(is.na(.), ., .[which.max(nchar(.))])) %>% 
    tibble(domain = .)
  found <- left_join(found, bakshy, by = "domain") %>%
    rename(bakshy_domain = domain)
  return(found)
}
```

```{r, echo = FALSE, eval = FALSE}
# Compute partisan alignments via URL section name proxies. 
# Warning: very computationally intensive (recommend running on AWS or just use
# the output I have already saved.)
news <- news %>% pull(url) %>% find_bakshy_align() %>%
  bind_cols(news, .)

news <- news %>% rename(avg_align = b_align)
saveRDS(news, file = "input/TGI_Data/TGI2021NewsMerge.rds")
```

```{r, echo = FALSE}
news <- readRDS("input/TGI_Data/TGI2021NewsMerge.rds")

just_domain <- function(x) suffix_extract(url_parse(x)$domain)$domain

alexa355 <- read.csv("input/TGI_Data/politicaldomains-final-edits.csv") %>%
  select(domain, type)
alexa355$domain <- just_domain(alexa355$domain)
alexa355 <- alexa355 %>% mutate(
  type = as.character(type),
  type = ifelse(type == "aggregator", "portal", type))

gdf <- news %>% select(caseid, avg_align, weight,
  age, race, female, income, ideology, newsint,
  educ, pid)

mean_align <- gdf %>% select(caseid, avg_align) %>%
  group_by(caseid) %>% 
  dplyr::summarize(mean_align = mean(avg_align, na.rm = TRUE))

gdf <- mean_align %>%
  left_join(gdf %>% select(-avg_align) %>%distinct(),
    by = "caseid") %>% 
  distinct()

age_levels <- paste0("Age: ", c("Under 30", "30-44", "45-59", "60+"))
race_levels <- paste0("Race: ", c("Other", "White", "Black", "Hispanic"))
educ_levels <- paste0("Educ.: ", c("Some or no HS", "High school", "Some college", "College graduate", "Postgraduate"))
party_levels <- paste0("Party: ", c("Independent", "Democrat",
  "Republican"))
ideo_levels <- paste0("Ideo.: ", c("Moderate", "Liberal", "Conservative", "Very conservative", "Very liberal"))

inc_levels <- c("$10,000 - $19,999",
  "$20,000 - $29,999", "$50,000 - $59,999", 
  "$70,000 - $79,999",   "$30,000 - $39,999",
  "$120,000 - $149,999",
  "$80,000 - $99,999",   "$150,000 - $199,999", "$40,000 - $49,999",
  "$100,000 - $119,999", "$60,000 - $69,999",   "$150,000 or more",
  "Less than $10,000", "$350,000 - $499,999", "$200,000 - $249,999",
  "$250,000 - $349,999")

inc_levels <- inc_levels[c(13, 1, 2, 5, 9, 3, 11, 4,
  7, 10, 6, 8, 15, 16)]

ideo_levels <- c("Conservative", "Liberal", "Moderate",
  "Very conservative","Very liberal", "Don't know")
ideo_levels <- ideo_levels[c(6, 1, 2, 3, 4, 5)]

ideo5 <- addNA(gdf$ideology)
levels(ideo5) <- c(levels(gdf$ideology), "Don't know")
gdf$ideo5 <- ideo5
gdf <- gdf %>% mutate(
  Age = factor(case_when(
    age < 30 ~ age_levels[1],
    age >= 30 & age <= 45 ~ age_levels[2],
    age >= 45 & age <= 59 ~ age_levels[3],
    age >= 60 ~ age_levels[4]), levels = age_levels),
  Race = factor(case_when(
    race == "White" ~ race_levels[2],
    race == "Black" ~ race_levels[3],
    race == "Hispanic" ~ race_levels[4],
    TRUE ~ race_levels[1]), levels = race_levels),
  Educ = factor(case_when(
    educ == "No HS" ~ educ_levels[1],
    educ == "High school graduate" ~ educ_levels[2],
    educ %in% c("Some college", "2-year") ~ educ_levels[3],
    educ == "4-year" ~ educ_levels[4],
    educ == "Post-grad" ~ educ_levels[5]), levels = educ_levels),
  Party = factor(case_when(
    pid <= 3 ~ party_levels[2],
    pid == 4 ~ party_levels[1],
    pid >= 5 ~ party_levels[3]), levels = party_levels),
  Income = factor(income, levels = inc_levels),
  Income_Numeric = as.integer(Income),
  Ideology = factor(ideo5, levels = ideo_levels),
  Ideology_Numeric = as.integer(Ideology),
  Female = 1 * (female == "Female"),
)

# Tyler, Grimmer, and Iyengar (2021) methodology applied to Guess data. 
gnews <- readRDS("input/TGI_Data/GG2021NewsMerge.rds") # Guess (2021) data is confidential and the raw URL procedure has been omitted for data privacy reasons. 

mean_align <- gnews %>% select(id, avg_align) %>%
  group_by(id) %>% 
  dplyr::summarize(mean_align = mean(avg_align, na.rm = TRUE))

gnews <- mean_align %>%
  left_join(gnews %>% select(-avg_align) %>% distinct(),
    by = "id") %>% 
  distinct()

# Tyler, Grimmer, and Iyengar (2021) expanded with more URL section name proxies
# and applied to Guess (2021) data.
hnews <- readRDS("input/TGI_Data/HG2021NewsMerge.rds")

mean_align <- hnews %>% 
  select(id, avg_align) %>%
  group_by(id) %>% 
  dplyr::summarize(mean_align = mean(avg_align, na.rm = TRUE))

hnews <- mean_align %>%
  left_join(hnews %>% select(-avg_align) %>% distinct(),
    by = "id") %>% 
  distinct()
```

```{r, echo = FALSE, warning = FALSE, message = FALSE}
# Figure 4. Americans' Online Media Diets (URL Section Proxy Method; Guess 2015)

# Compute plots and  overlap coefficients. 
# TGI 2016. 
tgi.15 <- hnews

rep <- tgi.15 %>% filter(pid3 == "Republican") %>% pull(mean_align)
dem <- tgi.15 %>% filter(pid3 == "Democrat") %>% pull(mean_align)
ind <- tgi.15 %>% filter(pid3 == "Independent") %>% pull(mean_align)
tgi.15.dr <- overlap(dem, rep)
tgi.15.di <- overlap(dem, ind)
tgi.15.ri <- overlap(rep, ind)

plt.tgi.15 <- tgi.15 %>% 
  filter(!is.na(pid3)) %>% 
  filter(pid3 != "Other/Don't know") %>% 
  ggplot(aes(mean_align, fill = pid3, color = pid3, weight = 
               weights/sum(weights))) + 
  geom_density(alpha = 0.5) +
  annotate(geom = "text", x = 0.75, y = 0.5, 
           label = "Independents", size = 3, color = "black") +
  annotate(geom = "text", x = 0.25, y = 1.75, 
           label = "Republicans", size = 3, color = "firebrick1") +
  annotate(geom = "text", x = -0.8, y = 1, 
           label = "Democrats", size = 3, color = "dodgerblue4") +
  scale_fill_grey() + 
  scale_color_manual(values = c("dodgerblue4",
                                "black",
                                "firebrick1")) +
  theme_classic() + 
  ggtitle("Figure 4. Americans' Online Media Diets (URL Proxy Method)") +
  labs(x = "Average media diet slant",
       y = "Density") +
  lims(x = c(-1, 1)) +
  theme(legend.position = "none",
        plot.title = element_text(size = 14, face = "bold"))
ggsave("output/figures/figure_4.pdf", device = "pdf", width = 6, height = 4)
```

Another study on American media diets, Tyler, Grimmer, & Iyengar (2021), differs from Guess (2021) in finding that Republicans and Democrats are not consuming the same types of online media and that, rather, they are consuming mostly media from right and left-leaning news sources, respectively.\footnote{We replicate the main findings and test model dependence in Tyler, Grimmer, \& Iyengar (2021) in Appendix A.4.} However, Tyler, Grimmer, & Iyengar (2021) employ a different URL news/not news classification method that Guess (2021) based on URL section name proxies. We outline this procedure in the introduction. 

\textbf{Figure 4} shows the results of applying the URL section name proxy classification method used in Tyler, Grimmer, & Iyengar (2021) to the 2015 web browsing data from Guess (2021) in order to test whether the divergent classification strategies employed in each paper produce comparable results. Here, we see that most Americans have even more overlap in their media consumption patterns and the probability densities appear to be much noisier. In the next section, we investigate whether different classification methods affect variable importance in models of American's media diet slants. 

```{r, echo = F, fig.show="hold", out.width="75%", fig.align="center"}
# Load Figure 4 into the RMarkdown document. 
knitr::include_graphics("output/figures/figure_4.pdf")
```

## Let's Get LOCO: What's Driving These Disparities in Results? 

Are the differences between Guess (2021) and Tyler, Grimmer, & Iyengar (2021) due to differential URL news/not news classification methods, underlying differences in survey samples, or something else? Firstly, the correlation between respondents' party identification and ideology in the 2015 data is 0.48 and in 2016 is 0.57 in Guess (2021), whereas the correlation in Tyler, Grimmer, & Iyengar (2021) is much higher at 0.83, suggesting that differences in each paper's survey samples are partially driving disparate results and conclusions in the papers. 

Next, we apply three machine learning algorithms, LASSO, stepwise regression, and random forests within Lei et al.'s (2018) leave-one-covariate-out (LOCO) inference framework\footnote{We explain these methodologies in greater detail in Appendix A.6.} to three datasets and models of American media consumption coming from (1) the Guess (2021) data with the logistic regression classification of URLs as news/not news [\textbf{Figure 5}], (2) the Tyler, Grimmer, & Iyengar (2021) data with the URL section name proxy classification method [\textbf{Figure 6}], and (3) the Guess (2021) data with the URL section name proxy classification method [\textbf{Figure 7}]. The red lines in each figure represent a 100% marginal confidence interval on the prediction error that occurs in each model if the specific covariate on the $x$-axis is removed from the model. In a sense, this is a unique way to get at the importance of each variable in a predictive framework. 

\textbf{Figures 5-7} display that, in the first dataset, ideology and partisan identification appear to be the most important variables in the prediction model. Partisanship and age are the most important variables in the second prediction model. And ideology, education, gender, and partisanship are the most important variables in the third dataset and model. Ultimately, these results support our intuition that both the different URL classification methods and underlying asymmetries in the survey sample in Guess (2021) and Tyler, Grimmer, & Iyengar (2021) are driving the disparate results in each paper.

```{r, echo=FALSE, waring = FALSE, message = FALSE}
# Figures 6/7: LOCO Analysis of Variable Importance in Guess (2021) and 
# Tyler, Grimmer, and Iyengar (2021).
# Sequester and reshape Guess (2021) data for LOCO prediction models. 
oldw <- getOption("warn") # Suppress annoying warnings from conformalInference R package. Be careful when running this code! It changes your global warnings settings to turn off all warnings. Make sure to run the bottom of this chunk or execute "options(warn = oldw)" manually in order to return your options to your original settings. 
options(warn = -1)

yg16.loco <- yg16 %>% 
  select(diet_mean_newspol,
         ideo5, 
         educ,
         gender,
         race,
         age,
         income,
         pid3,
         pol.class) %>% 
  drop_na() 

# Convert factors to integers.
yg16.loco$ideo5 <- yg16.loco$ideo5 %>% 
  factor(levels = yg16$ideo5 %>%  levels()) %>%
  as.integer()
yg16.loco$educ <- yg16.loco$educ %>% 
  factor(levels = yg16$educ %>% levels()) %>% 
  as.integer()
yg16.loco$gender <- yg16.loco$gender %>% 
  factor(levels = yg16$gender %>% levels()) %>% 
  as.integer()
yg16.loco$race <- yg16.loco$race %>% 
  factor(levels = yg16$race %>% levels()) %>% 
  as.integer()
yg16.loco$income <- yg16.loco$income %>% 
  factor(levels = unique(yg16$income %>% na.omit())) %>% 
  as.integer()
yg16.loco$pid3 <- yg16.loco$pid3 %>% 
  factor(levels = yg16$pid3 %>% levels()) %>% 
  as.integer()
yg16.loco$age <- yg16.loco$age %>% 
  factor(levels = unique(yg16$age)) %>% 
  as.integer()
yg16.loco$pol.class <- yg16.loco$pol.class %>% 
  factor(levels = yg16$pol.class %>% levels()) %>% 
  as.integer()

# Convert integer dataframe to numeric.
yg16.loco[] <- lapply(yg16.loco, as.numeric)

# Estimate LOCO predictions (one shot). 
set.seed(12345)
y <- yg16.loco$diet_mean_newspol
X <- yg16.loco[,-1]
varnames <- names(X)
my.lasso.funs <- lasso.funs(nlambda = 100, cv=TRUE, cv.rule = "1se")

fit.loco <- loco(X, y,
                 alpha=0.1, 
                 train.fun = my.lasso.funs$train,
                 predict.fun = my.lasso.funs$predict, 
                 active.fun = my.lasso.funs$active.fun)

# Capture LOCO results in plot.
mar = c(4.25,4.25,3.25,1)
w = 10
h = 8

pdf(file="output/figures/figure_5.pdf",w=w,h=h)
par(mfrow = c(3, 3))

ylim = range(fit.loco$inf.wilcox[[1]][,2:3])
J = length(fit.loco$active[[1]])
plot(c(), c(), xlim=c(1,J), ylim=ylim, xaxt="n",
     main="Figure 5. \n LOCO Analysis from Lasso + CV Model \n (Guess 2021 Models/Data)",
     xlab="Variable", ylab="Confidence interval")
axis(side=1, at=1:J, labels=FALSE)
for (j in 1:J) {
  axis(side=1, at=j, labels = varnames[fit.loco$active[[1]][j]], cex.axis=0.75,
       line=0.5*j%%2)
}
abline(h=0)
segments(1:J, fit.loco$inf.wilcox[[1]][,2],
         1:J, fit.loco$inf.wilcox[[1]][,3],
         col="red", lwd=2)

# With forward stepwise. 
my.step.funs <- lars.funs(type = "step", max.steps = 50, cv = TRUE, cv.rule = "1se")
fit.steploco <- loco(X, y, 
                     alpha = 0.1, 
                     train.fun = my.step.funs$train,
                     predict.fun = my.step.funs$predict,
                     active.fun = my.step.funs$active.fun)

# Capture stepwise LOCO results in plot. 
ylim = range(fit.steploco$inf.wilcox[[1]][,2:3])
J = length(fit.steploco$active[[1]])
plot(c(), c(), xlim=c(1,J), ylim=ylim, xaxt="n",
     main="LOCO Analysis from Stepwise + CV Model \n (Guess 2021 Models/Data)",
     xlab="Variable", ylab="Confidence interval")
axis(side=1, at=1:J, labels=FALSE)
for (j in 1:J) {
  axis(side=1, at=j, labels = varnames[fit.steploco$active[[1]][j]], cex.axis=0.75,
       line=0.5*j%%2)
}
abline(h=0)
segments(1:J, fit.steploco$inf.wilcox[[1]][,2],
         1:J, fit.steploco$inf.wilcox[[1]][,3],
         col="red", lwd=2)

# With random forests. 
my.rf.funs <- rf.funs(ntree = 500)
fit.rfloco <- loco(X, y, 
                   alpha = 0.1,
                   train.fun = my.rf.funs$train,
                   predict.fun = my.rf.funs$predict,
                   active.fun = my.rf.funs$active)

# Capture randomforest LOCO results in plot. 
ylim = range(fit.rfloco$inf.wilcox[[1]][,2:3])
J = length(fit.rfloco$active[[1]])
plot(c(), c(), xlim=c(1,J), ylim=ylim, xaxt="n",
     main="LOCO Analysis from Random Forest Model \n (Guess 2021 Models/Data)",
     xlab="Variable", ylab="Confidence interval")
axis(side=1, at=1:J, labels=FALSE)
for (j in 1:J) {
  axis(side=1, at=j, labels = varnames[fit.rfloco$active[[1]][j]], cex.axis=0.75,
       line=0.5*j%%2)
}
abline(h=0)
segments(1:J, fit.rfloco$inf.wilcox[[1]][,2],
         1:J, fit.rfloco$inf.wilcox[[1]][,3],
         col="red", lwd=2)

# -------------------------------------------------------------------------

# Sequester and reshape Tyler, Grimmer, & Iyengar (2021) data for LOCO prediction 
# models. 
gdf.loco <- gdf %>% 
  select(mean_align,
         Ideology_Numeric,
         Age,
         Race,
         Female,
         Income,
         pid,
         newsint) %>% 
  drop_na()

# Convert factors to integers. 
gdf.loco$Age <- gdf.loco$Age %>% as.integer()
gdf.loco$Race <- gdf.loco$Race %>% as.integer()
gdf.loco$Income <- gdf.loco$Income %>% as.integer()

# Convert integer dataframe to numeric.
gdf.loco[] <- lapply(gdf.loco, as.numeric)

# Estimate LOCO predictions (one shot). 
set.seed(12345)
y <- gdf.loco$mean_align
X <- gdf.loco[,-1]
varnames <- names(X)
my.lasso.funs <- lasso.funs(nlambda = 100, cv=TRUE, cv.rule = "1se")
fit.loco <- loco(X, y,
                 alpha=0.1, 
                 train.fun = my.lasso.funs$train,
                 predict.fun = my.lasso.funs$predict, 
                 active.fun = my.lasso.funs$active.fun)

# Capture LOCO results in plot.
ylim = range(fit.loco$inf.wilcox[[1]][,2:3])
J = length(fit.loco$active[[1]])
plot(c(), c(), xlim=c(1,J), ylim=ylim, xaxt="n",
     main="Figure 6. \n LOCO Analysis from Lasso + CV Model \n (TGI 2021 Models/Data)",
     xlab="Variable", ylab="Confidence interval")
axis(side=1, at=1:J, labels=FALSE)
for (j in 1:J) {
  axis(side=1, at=j, labels=varnames[fit.loco$active[[1]][j]], cex.axis=0.75,
       line=0.5*j%%2)
}
abline(h=0)
segments(1:J, fit.loco$inf.wilcox[[1]][,2],
         1:J, fit.loco$inf.wilcox[[1]][,3],
         col="red", lwd=2)

# With forward stepwise. 
my.step.funs <- lars.funs(type = "step", max.steps = 50, cv = TRUE, cv.rule = "1se")
fit.steploco <- loco(X, y, 
                     alpha = 0.1, 
                     train.fun = my.step.funs$train,
                     predict.fun = my.step.funs$predict,
                     active.fun = my.step.funs$active.fun)

# Capture stepwise LOCO results in plot. 
ylim = range(fit.steploco$inf.wilcox[[1]][,2:3])
J = length(fit.steploco$active[[1]])
plot(c(), c(), xlim=c(1,J), ylim=ylim, xaxt="n",
     main="LOCO Analysis from Stepwise + CV Model \n (TGI 2021 Models/Data)",
     xlab="Variable", ylab="Confidence interval")
axis(side=1, at=1:J, labels=FALSE)
for (j in 1:J) {
  axis(side=1, at=j, labels=varnames[fit.steploco$active[[1]][j]], cex.axis=0.75,
       line=0.5*j%%2)
}
abline(h=0)
segments(1:J, fit.steploco$inf.wilcox[[1]][,2],
         1:J, fit.steploco$inf.wilcox[[1]][,3],
         col="red", lwd=2)

# With random forests. 
my.rf.funs <- rf.funs(ntree = 500)
fit.rfloco <- loco(X, y, 
                   alpha = 0.1,
                   train.fun = my.rf.funs$train,
                   predict.fun = my.rf.funs$predict,
                   active.fun = my.rf.funs$active)

# Capture randomforest LOCO results in plot. 
ylim = range(fit.rfloco$inf.wilcox[[1]][,2:3])
J = length(fit.rfloco$active[[1]])
plot(c(), c(), xlim=c(1,J), ylim=ylim, xaxt="n",
     main="LOCO Analysis from Random Forest Model \n (TGI 2021 Models/Data)",
     xlab="Variable", ylab="Confidence interval")
axis(side=1, at=1:J, labels=FALSE)
for (j in 1:J) {
  axis(side=1, at=j, labels=varnames[fit.rfloco$active[[1]][j]], cex.axis=0.75,
       line=0.5*j%%2)
}
abline(h=0)
segments(1:J, fit.rfloco$inf.wilcox[[1]][,2],
         1:J, fit.rfloco$inf.wilcox[[1]][,3],
         col="red", lwd=2)

# -------------------------------------------------------------------------
hnews.loco <- hnews %>% 
  select(mean_align,
         ideo5, 
         educ,
         gender,
         race,
         age,
         income,
         pid3) %>% 
  drop_na() 

# Convert factors to integers.
hnews.loco$ideo5 <- hnews.loco$ideo5 %>% 
  factor(levels = yg16$ideo5 %>% levels()) %>%
  as.integer()
hnews.loco$educ <- hnews.loco$educ %>% 
  factor(levels = yg16$educ %>% levels()) %>% 
  as.integer()
hnews.loco$gender <- hnews.loco$gender %>% 
  factor(levels = yg16$gender %>% levels()) %>% 
  as.integer()
hnews.loco$race <- hnews.loco$race %>% 
  factor(levels = yg16$race %>% levels()) %>% 
  as.integer()
hnews.loco$income <- hnews.loco$income %>% 
  factor(levels = unique(yg16$income %>% na.omit())) %>% 
  as.integer()
hnews.loco$pid3 <- hnews.loco$pid3 %>% 
  factor(levels = yg16$pid3 %>% levels()) %>% 
  as.integer()
hnews.loco$age <- hnews.loco$age %>% 
  factor(levels = unique(yg16$age)) %>% 
  as.integer()

# Convert integer dataframe to numeric.
hnews.loco[] <- lapply(hnews.loco, as.numeric)

# Estimate LOCO predictions (one shot). 
set.seed(12345)
y <- hnews.loco$mean_align
X <- hnews.loco[,-1]
varnames <- names(X)
my.lasso.funs <- lasso.funs(nlambda = 100, cv=TRUE, cv.rule = "1se")

fit.loco <- loco(X, y,
                 alpha=0.1, 
                 train.fun = my.lasso.funs$train,
                 predict.fun = my.lasso.funs$predict, 
                 active.fun = my.lasso.funs$active.fun)

# Capture LOCO results in plot.
mar = c(4.25,4.25,3.25,1)
w = 10
h = 8

ylim = range(fit.loco$inf.wilcox[[1]][,2:3])
J = length(fit.loco$active[[1]])
plot(c(), c(), xlim=c(1,J), ylim=ylim, xaxt="n",
     main="Figure 7. \n LOCO Analysis from Lasso + CV Model \n (Guess + URL Proxy 2021 Models/Data)",
     xlab="Variable", ylab="Confidence interval")
axis(side=1, at=1:J, labels=FALSE)
for (j in 1:J) {
  axis(side=1, at=j, labels=varnames[fit.loco$active[[1]][j]], cex.axis=0.75,
       line=0.5*j%%2)
}
abline(h=0)
segments(1:J, fit.loco$inf.wilcox[[1]][,2],
         1:J, fit.loco$inf.wilcox[[1]][,3],
         col="red", lwd=2)

# With forward stepwise. 
my.step.funs <- lars.funs(type = "step", max.steps = 50, cv = TRUE, cv.rule = "1se")
fit.steploco <- loco(X, y, 
                     alpha = 0.1, 
                     train.fun = my.step.funs$train,
                     predict.fun = my.step.funs$predict,
                     active.fun = my.step.funs$active.fun)

# Capture stepwise LOCO results in plot. 
ylim = range(fit.steploco$inf.wilcox[[1]][,2:3])
J = length(fit.steploco$active[[1]])
plot(c(), c(), xlim=c(1,J), ylim=ylim, xaxt="n",
     main="LOCO Analysis from Stepwise + CV Model \n (Guess + URL Proxy 2021 Models/Data)",
     xlab="Variable", ylab="Confidence interval")
axis(side=1, at=1:J, labels=FALSE)
for (j in 1:J) {
  axis(side=1, at=j, labels=varnames[fit.steploco$active[[1]][j]], cex.axis=0.75,
       line=0.5*j%%2)
}
abline(h=0)
segments(1:J, fit.steploco$inf.wilcox[[1]][,2],
         1:J, fit.steploco$inf.wilcox[[1]][,3],
         col="red", lwd=2)

# With random forests. 
my.rf.funs <- rf.funs(ntree = 500)
fit.rfloco <- loco(X, y, 
                   alpha = 0.1,
                   train.fun = my.rf.funs$train,
                   predict.fun = my.rf.funs$predict,
                   active.fun = my.rf.funs$active)

# Capture randomforest LOCO results in plot. 
ylim = range(fit.rfloco$inf.wilcox[[1]][,2:3])
J = length(fit.rfloco$active[[1]])
plot(c(), c(), xlim=c(1,J), ylim=ylim, xaxt="n",
     main="LOCO Analysis from Random Forest Model \n (Guess + URL Proxy 2021 Models/Data)",
     xlab="Variable", ylab="Confidence interval")
axis(side=1, at=1:J, labels=FALSE)
for (j in 1:J) {
  axis(side=1, at=j, labels=varnames[fit.rfloco$active[[1]][j]], cex.axis=0.75,
       line=0.5*j%%2)
}
abline(h=0)
segments(1:J, fit.rfloco$inf.wilcox[[1]][,2],
         1:J, fit.rfloco$inf.wilcox[[1]][,3],
         col="red", lwd=2)

graphics.off()

options(warn = oldw)
```
```{r, echo = FALSE, echo = F, fig.show="hold", out.width="100%", fig.align="center"}
# Output figures 6/7/8. 
knitr::include_graphics("output/figures/figure_5.pdf")
```

```{r, echo = FALSE}
# Correlations between ideology and partisanship. 
pid.15.g <- yg15$pid3 %>% as.numeric() %>% Recode("2=4;4=2") %>% na.omit()
pid.16.g <- yg16$pid3 %>% as.numeric() %>% Recode("2=4;4=2") %>% na.omit()
pid.tgi <- gdf$Party %>% as.numeric() %>% Recode("NA=1;3=2;1=3;2=4") %>% na.omit()
pid.hnews <- hnews.loco$pid3 %>% Recode("2=4;4=2") %>% na.omit()
ideo.15.g <- yg15$ideo5 %>% as.numeric() %>% Recode("5=2;2=3;3=5") %>% na.omit()
ideo.16.g <- yg16$ideo5 %>% as.numeric() %>% Recode("5=2;2=3;3=5") %>% na.omit()
ideo.tgi <- gdf$Ideology %>% as.numeric() %>% Recode("NA=1;5=2;2=3;3=5") %>% na.omit()
ideo.hnews <- hnews.loco$ideo5 %>% Recode("5=2;2=3;3=5") %>% na.omit()

# Guess 2015 ideology-party ID correlation. 
cor.15.g <- cor(pid.15.g, 
                ideo.15.g)

# Guess 2016 ideology-party ID correlation. 
cor.16.g <- cor(pid.16.g,
                ideo.16.g)

# Tyler, Grimmer, & Iyengar 2016 ideology-party ID correlation. 
cor.tgi <- cor(pid.tgi,
               ideo.tgi)

# Guess (2021) + URL Proxy Method ideology-party ID correlation.
cor.hnews <- cor(pid.hnews,
                 ideo.hnews)
```

# Discussion and Conclusions

Overall, our replication study found that, when stratified by party, most Americans seem to overlap in their media consumption patterns regardless of their affiliation. However, when stratified by ideology, we find that more ideological partisan respondents exhibit much less overlap in their media consumption. The "moderate majority" or moderate partisans, non-ideological individuals, or people with no issue content whatsoever in their beliefs may be driving findings that suggest lack of polarization in the electorate. Finally, we found that two papers making conclusions about American media diets, Guess (2021) and Tyler, Grimmer, & Iyengar (2021), came to different conclusions as a result of \textit{both} different URL classification methods \textit{and} differences in the underlying survey sample. 

There were several limitations to our replication. The first is that we were unable to obtain the 2016 data in Guess (2021) due to privacy restrictions. The second is that all of the data in each year for the two papers come from different surveys with nebulous and potentially differing question wordings and sampling designs. We also did not have a full suite of data between the years 2015-2020 that we could use to determine whether long-term trends in media diet slants point towards ideological extremification in the American mass public.  

Another question that this work leaves open is, who are the people who allow their web browsing data to be tracked? Guess (2021) states that the respondents tended to be younger and more Democratic than a representative sample of Americans. Tyler, Grimmer, and Iyengar (2021) note that online panel participants tended to be more active Internet users than the general population. But, just as previous telephone surveys find that their samples tend to have differential rates of older women as respondents, and may be correlated along an underlying psychological dimension of agreeableness, a concern may be that certain panels are including very ideologically motivated partisans that are not representative of the average American partisan. Future studies should investigate whether poststratification weights can really offset differential nonresponse and psychological-attributional groupings among the willing survey sample. 

We conclude with several recommendations for researchers who wish to study how Americans consume media online using similar research designs. The first is that study samples may need to be much larger in order to capture the full heterogeneity of Americans' web browsing behavior. The second is that more background characteristics should be collected in these surveys. Party identification and ideology should be measured by a 7-point scale wherein independents and moderates are pressed to provide their party and ideological leanings so as to facilitate greater specificity and precision in model estimation. Further, surveys should include questions asking about political interest and about how many days during the average week do respondents engage with TV/magazine/internet news in order to triangulate stated versus realized observational patterns. Third, researchers should pay more attention to the composition of the samples in the panel or other survey from which they obtain browsing data; it is unclear whether heterogeneities in browsing data can be rectified simply by applying posstratification weights to unrepresentative data plagued by differential nonresponse. Finally, researchers studying American media diets should ask, which variable is really more important and influential: party, ideology, or their interaction? 

\newpage 

# Bibliography 

<div id="refs"></div>

\newpage

# Appendix

## (A.1) Replication of Main Guess (2021) Model 

We replicate the original Guess (2021) media consumption prediction model below. 

\setcounter{table}{0}

```{r, echo = FALSE, results = 'asis'}
# Run models. 
model.1.2015 <- lm(diet_mean_newspol ~ age + race + gender + income + educ + pid3,
                   data = yg15,
                   weights = weights)
model.1.2016 <- lm(diet_mean_newspol ~ age + race + gender + income + educ + pid3,
                   data = yg16, 
                   weights = weight)

# Sequester variable names.
varnames <- str_replace(str_replace(str_replace(str_replace(str_replace(
  str_replace(attr(summary(model.1.2015)$coefficients, "dimnames")[[1]], 
              "age", "Age: "), "race", "Race: "), "educ", ""), "pid3", ""), 
  "gender", ""), "income", "Income level")

# Output table.
stargazer(model.1.2015,
          model.1.2016,
          style = "apsr",
          title = "\\textbf{Determinants of Media Diet Slant (Guess 2021)}",
          dep.var.labels = "Average media diet slant (news/politics only)",
          column.labels = c("\\emph{2015}", "\\emph{2016}"),
          se = starprep(model.1.2015, model.1.2016),
          p = starprep(model.1.2015, model.1.2016, stat = "p.value"),
          covariate.labels = varnames[2:length(varnames)],
          omit.stat = c("f", "ser", "rsq"),
          column.sep.width = "0pt",
          star.cutoffs = c(.05, .01),
          align = TRUE,
          notes = c("OLS regressions with HC2 robust standard errors in",
                    "parentheses; YouGov survey data with weights applied."),
          notes.append = TRUE,
          header = FALSE,
          float = TRUE,
          table.placement = "H",
          font.size = "footnotesize")
```

## (A.2) Full Results of Ideology-Extended Models of Media Consumption

We present a full table of the extension of the Guess (2021) media consumption prediction model by varying the inclusion of measures of respondent ideology.  

\setcounter{table}{2}

```{r, echo = FALSE, results = 'asis'}
# Sequester correct models. 
model.1.2015 <- lm(diet_mean_newspol ~ age + race + gender + income + educ + pid3,
                   data = yg15,
                   weights = weights)
model.2.2015 <- lm(diet_mean_newspol ~ age + race + gender + income + educ + ideo5,
                   data = yg15,
                   weights = weights)
model.3.2015 <- lm(diet_mean_newspol ~ age + race + gender + income + educ + pid3 +
                     ideo5,
                   data = yg15,
                   weights = weights)
model.1.2016 <- lm(diet_mean_newspol ~ age + race + gender + income + educ + pid3,
                   data = yg16, 
                   weights = weight)
model.2.2016 <- lm(diet_mean_newspol ~ age + race + gender + income + educ + ideo5,
                   data = yg16, 
                   weights = weight)
model.3.2016 <- lm(diet_mean_newspol ~ age + race + gender + income + educ + pid3 +
                   ideo5,
                   data = yg16, 
                   weights = weight)

varnames <- str_replace(str_replace(str_replace(str_replace(str_replace(str_replace(
  str_replace(attr(summary(model.3.2015)$coefficients, "dimnames")[[1]], 
              "age", "Age: "), "race", "Race: "), "educ", ""), "pid3", ""), 
  "gender", ""), "income", "Income level"), "ideo5", "")

# Output model results in stargazer table. 
stargazer(model.1.2015,
          model.2.2015,
          model.3.2015,
          model.1.2016,
          model.2.2016,
          model.3.2016,
          style = "apsr", 
          title = "\\textbf{Determinants of Media Diet Slant (Including Ideology)}",
          dep.var.labels = "Average media diet slant (news/politics only)", 
          column.labels = c("\\emph{2015} (PID)", 
                            "\\emph{2015} (Ideo)",
                            "\\emph{2015} (Combined)",
                            "\\emph{2016} (PID)",
                            "\\emph{2016} (Ideo)",
                            "\\emph{2016} (Combined)"), 
          se = starprep(model.1.2015, model.2.2015, model.3.2015,
                        model.1.2016, model.2.2016, model.3.2016),
          p = starprep(model.1.2015, model.2.2015, model.3.2015,
                       model.1.2016, model.2.2016, model.3.2016,
                       stat = "p.value"),
          covariate.labels = varnames[2:length(varnames)],
          omit.stat = c("f", "ser", "rsq"), column.sep.width = "0pt", 
          star.cutoffs = c(.05, .01),
          align = TRUE,
          notes = c("OLS regressions with HC2 robust standard errors in parentheses; YouGov survey data with weights applied."),
          add.lines = list("",
                           c("Party ID", "\\checkmark", "", "\\checkmark", 
                             "\\checkmark", "", "\\checkmark"),
                           c("Ideology", "", "\\checkmark", "\\checkmark", 
                             "", "\\checkmark", "\\checkmark")),
          notes.append = TRUE,
          header = FALSE,
          float = TRUE,
          table.placement = "H",
          font.size = "scriptsize")
```

## (A.3) Tabulation of Respondent Ideological Types Using Guess (2021) Data

```{r, echo = FALSE, results = 'asis'}
# Output tabulations of respondent ideological classes. 
ideo.dat <- matrix(data = c(pc.2015 %>% as.numeric(), 
                            pc.2016 %>% as.numeric()),
                   nrow = 2,
                   byrow = TRUE, 
                   dimnames = list(c(2015, 2016),
                                   pc.2015 %>% names()))
kable(ideo.dat,
      align = rep('c', 7),
      format = "latex",
      booktabs = T,
      escape = FALSE,
      caption = "\\textbf{Proportion of Respondents in Guess (2021) by Ideological Type}") %>% 
      kable_styling(position = "center", 
                    font_size = 7,
                    latex_options = "HOLD_position")
```

\newpage

## (A.4) Replication of Tyler, Grimmer, & Iyengar (2021) Results

We replicate the main media diet distributions by partisanship figure using the data from Tyler, Grimmer, & Iyengar (2021) as well as their URL section name proxy method for classifying whether URLs are news or not news in \textbf{Figure 8}. \textbf{Table 5} displays five different model specifications that include party identification, ideology, both, both and a political interest variable, and both with a political interest and party identification interaction. Finally, \textbf{Figure 9} shows measures of model dependence in the Tyler, Grimmer, & Iyengar (2021) models and data. The model containing only party identification and the model that contains party identification, ideology, and the measure of political interest appear to have the lowest model dependence, findings that diverge significantly from the data and models in Guess (2021)---that, respectively, are improved dramatically by the model that includes only ideology---and were proven to be a likely artefact of the survey sampling process and differential classification methodology.  

```{r, echo = FALSE, warning = FALSE, message = FALSE}
# Figure 8. Americans' Online Media Diets (URL Section Proxy Method; 2016)

# Compute plots and  overlap coefficients. 
# TGI 2016. 
tgi.16 <- news %>% filter(!is.na(avg_align)) 
tgi.16$party <- recode(tgi.16$party, "dem" = "Democrat", "rep" = "Republican", "ind" = "Independent") 
tgi.16 <- tgi.16 %>%
  group_by(caseid, party) %>%
  dplyr::summarize(mean_align = mean(avg_align)) %>% 
  ungroup()

rep <- tgi.16 %>% filter(party == "Republican") %>% pull(mean_align)
dem <- tgi.16 %>% filter(party == "Democrat") %>% pull(mean_align)
ind <- tgi.16 %>% filter(party == "Independent") %>% pull(mean_align)
tgi.16.dr <- overlap(dem, rep)
tgi.16.di <- overlap(dem, ind)
tgi.16.ri <- overlap(rep, ind)

plt.tgi.16 <- tgi.16 %>% 
  filter(!is.na(party)) %>% 
  ggplot(aes(mean_align, fill = party, color = party)) + 
  geom_density(alpha = 0.5) +
  annotate(geom = "text", x = 0.75, y = 0.5, 
           label = "Independents", size = 3, color = "black") +
  annotate(geom = "text", x = 0.25, y = 1.75, 
           label = "Republicans", size = 3, color = "firebrick1") +
  annotate(geom = "text", x = -0.8, y = 1, 
           label = "Democrats", size = 3, color = "dodgerblue4") +
  scale_fill_grey() + 
  scale_color_manual(values = c("dodgerblue4",
                                "firebrick1",
                                "black")) +
  theme_classic() + 
  ggtitle("Figure 8. Americans' Online Media Diets (URL Proxy)") +
  labs(x = "Average media diet slant",
       y = "Density") +
  lims(x = c(-1, 1)) +
  theme(legend.position = "none",
        plot.title = element_text(size = 14, face = "bold"))
ggsave("output/figures/figure_8.pdf", device = "pdf", width = 6, height = 4)
```

```{r, echo = F, fig.show="hold", out.width="75%", fig.align="center"}
# Load Figure 4 into the RMarkdown document. 
knitr::include_graphics("output/figures/figure_8.pdf")
```

```{r, echo = FALSE, warning = FALSE, message = FALSE}
# Compute K-S tests and bootstrapped KS-tests of distributional equality. 
ks.overall <- ks.test(gdf$mean_align,
                      yg16$diet_mean_newspol) # Overall Distributions
ks.overall.boot <- ks.boot(gdf$mean_align,
                           yg16$diet_mean_newspol)
ks.dem <- ks.test(gdf$mean_align[gdf$Party == "Party: Democrat"],
        yg16$diet_mean_newspol[yg16$pid3 == "Democrat"]) # Democrat Distributions
ks.dem.boot <- ks.boot(gdf$mean_align[gdf$Party == "Party: Democrat"],
        yg16$diet_mean_newspol[yg16$pid3 == "Democrat"])

ks.ind <- ks.test(gdf$mean_align[gdf$Party == "Party: Independent"],
        yg16$diet_mean_newspol[yg16$pid3 == "Independent"]) # Independent Distributions
ks.ind.boot <- ks.boot(gdf$mean_align[gdf$Party == "Party: Independent"],
        yg16$diet_mean_newspol[yg16$pid3 == "Independent"])

ks.rep <- ks.test(gdf$mean_align[gdf$Party == "Party: Republican"],
        yg16$diet_mean_newspol[yg16$pid3 == "Republican"]) # Republican Distributions
ks.rep.boot <- ks.boot(gdf$mean_align[gdf$Party == "Party: Republican"],
        yg16$diet_mean_newspol[yg16$pid3 == "Republican"])
```

```{r, echo = FALSE, results = 'asis'}
# Table 5: Regressions Using URL Section Proxy Data. 
gdf$newsintxpid <- interaction(gdf$Party, gdf$newsint)
greg <- lm(mean_align ~ Age + Race + Female + Income_Numeric + Educ + Party,
           data = gdf, 
           weights = gdf$weight)

greg2 <- lm(mean_align ~ Age + Race + Female + Income_Numeric + Educ + 
              Ideology,
           data = gdf, 
           weights = gdf$weight)

greg3 <- lm(mean_align ~ Age + Race + Female + Income_Numeric + Educ + Party + 
              Ideology,
           data = gdf, 
           weights = gdf$weight)
greg4 <- lm(mean_align ~ Age + Race + Female + Income_Numeric + Educ + Party + 
              Ideology + newsint,
           data = gdf, 
           weights = gdf$weight)

greg5 <- lm(mean_align ~ Age + Race + Female + Income_Numeric +
            Educ + Party + Ideology + newsint + newsintxpid,
            data = gdf, 
            weights = gdf$weight)

varnames <- c("Age: 30-44",
              "Age: 45-59",
              "Age: 60+",
              "Race: White",
              "Race: Black",
              "Race: Hispanic",
              "Female",
              "Income level",
              "High school",
              "Some college",
              "College graduate",
              "Postgraduate",
              "Democrat",
              "Republican",
              "Conservative",
              "Liberal",
              "Moderate",
              "Very conservative",
              "Very liberal",
              "Political Interest",
              "Democrat x Interest",
              "Republican x Interest")

# Output model results in stargazer table. 
stargazer(greg,
          greg2,
          greg3,
          greg4,
          greg5,
          style = "apsr", 
          title = "\\textbf{Determinants of Media Diet Slant (Using URL Section Proxies)}",
          dep.var.labels = "Average media diet slant (news/politics only)", 
          column.labels = c("(PID)", 
                            "(Ideo)",
                            "(Combined)",
                            "(Interest)", 
                            "(Party x Interest)"), 
          se = starprep(greg, greg2, greg3, greg4),
          p = starprep(greg, greg2, greg3, greg4, stat = "p.value"),
          omit = c("newsintxpidParty: Independent.1",
                   "newsintxpidParty: Democrat.1",  
                   "newsintxpidParty: Republican.1"),
          covariate.labels = varnames,
          omit.stat = c("f", "ser", "rsq"), column.sep.width = "0pt", 
          star.cutoffs = c(.05, .01),
          align = TRUE,
          notes = c("OLS regressions with HC2 robust standard errors in parentheses; YouGov survey data with weights applied."),
          add.lines = list("",
                           c("Party ID", "\\checkmark", "", "\\checkmark", "\\checkmark", "\\checkmark"),
                           c("Ideology", "", "\\checkmark", "\\checkmark", "\\checkmark", "\\checkmark"),
                           c("Political Interest", "", "", "", "\\checkmark", "\\checkmark")),
          notes.append = TRUE,
          header = FALSE,
          float = TRUE,
          table.placement = "H",
          font.size = "scriptsize")

gglm <- glm(mean_align ~ Age + Race + Female + Income_Numeric + Educ + Party,
            data = gdf, 
            weights = gdf$weight)

gglm2 <- glm(mean_align ~ Age + Race + Female + Income_Numeric + Educ + 
              Ideology,
             data = gdf, 
             weights = gdf$weight)

gglm3 <- glm(mean_align ~ Age + Race + Female + Income_Numeric + Educ + Party + 
              Ideology,
             data = gdf, 
             weights = gdf$weight)

gglm4 <- glm(mean_align ~ Age + Race + Female + Income_Numeric + Educ + Party + 
              Ideology + newsint,
             data = gdf, 
             weights = gdf$weight)

gglm5 <- glm(mean_align ~ Age + Race + Female + Income_Numeric +
             Educ + Party + Ideology + Party * newsint,
             data = gdf, 
             weights = gdf$weight)
greg5.2 <- lm(mean_align ~ Age + Race + Female + Income_Numeric +
             Educ + Party + Ideology + Party * newsint,
             data = gdf, 
             weights = gdf$weight)
```

```{r, echo = FALSE}
# MAUD/SMAUD/GIM Tests on New Data. 

# Compute MAUDs and SMAUDs for various model types. 
maud.tgi.1 <- maud.gen(starprep(greg)[[1]], 
                       summary(greg)$coefficients[,2])
maud.tgi.2 <- maud.gen(starprep(greg2)[[1]],
                       summary(greg2)$coefficients[,2])
maud.tgi.3 <- maud.gen(starprep(greg3)[[1]], 
                       summary(greg3)$coefficients[,2])
maud.tgi.4 <- maud.gen(starprep(greg4)[[1]],
                       summary(greg4)$coefficients[,2])
maud.tgi.5 <- maud.gen(starprep(greg5.2)[[1]],
                       summary(greg5.2)$coefficients[,2])
smaud.tgi.1 <- maud.gen(starprep(greg)[[1]], 
                        summary(greg)$coefficients[,2], 1)
smaud.tgi.2 <- maud.gen(starprep(greg2)[[1]],
                        summary(greg2)$coefficients[,2], 1)
smaud.tgi.3 <- maud.gen(starprep(greg3)[[1]], 
                        summary(greg3)$coefficients[,2], 1)
smaud.tgi.4 <- maud.gen(starprep(greg4)[[1]],
                        summary(greg4)$coefficients[,2], 1)
smaud.tgi.5 <- maud.gen(starprep(greg5.2)[[1]],
                        summary(greg5.2)$coefficients[,2], 1)
mauds.tgi <- c(maud.tgi.1, maud.tgi.2, maud.tgi.3,
               maud.tgi.4, maud.tgi.5)
smauds.tgi <- c(smaud.tgi.1, smaud.tgi.2, smaud.tgi.3,
                smaud.tgi.4, smaud.tgi.5)
avg.guess <- c(mean(mauds),
               mean(smauds))
avg.tgi <- c(mean(mauds.tgi),
             mean(smauds.tgi))
```

```{r, echo = FALSE, eval = FALSE}
# Compute GIM test and save (quite computationally intensive).
GIM.tgi.1 <- GIM(gglm, full = TRUE, B = 30, B2 = 25)
GIM.tgi.2 <- GIM(gglm2, full = TRUE, B = 30, B2 = 25)
GIM.tgi.3 <- GIM(gglm3, full = TRUE, B = 30, B2 = 25)
GIM.tgi.4 <- GIM(gglm4, full = TRUE, B = 30, B2 = 25)
GIM.tgi.5 <- GIM(gglm5, full = TRUE, B = 30, B2 = 25)
saveRDS(GIM.tgi.1, file = "input/TGI_Data/GIM_TGI_1.RData")
saveRDS(GIM.tgi.2, file = "input/TGI_Data/GIM_TGI_2.RData")
saveRDS(GIM.tgi.3, file = "input/TGI_Data/GIM_TGI_3.RData")
saveRDS(GIM.tgi.4, file = "input/TGI_Data/GIM_TGI_4.RData")
saveRDS(GIM.tgi.5, file = "input/TGI_Data/GIM_TGI_5.RData")
```

```{r, echo = FALSE}
# Figure 10. MAUD/SMAUD/GIM Tests on New Data. 

# Load pre-computed GIM Tests. 
GIM.tgi.1 <- readRDS("input/TGI_Data/GIM_TGI_1.RData")
GIM.tgi.2 <- readRDS("input/TGI_Data/GIM_TGI_2.RData")
GIM.tgi.3 <- readRDS("input/TGI_Data/GIM_TGI_3.RData")
GIM.tgi.4 <- readRDS("input/TGI_Data/GIM_TGI_4.RData")
GIM.tgi.5 <- readRDS("input/TGI_Data/GIM_TGI_5.RData")

# MAUD/SMAUD subfigures. 
unc.dat <- data.frame(maud = mauds.tgi, 
                      smaud = smauds.tgi, 
                      rot = c(GIM.tgi.1$`Rule of Thumb`,
                             GIM.tgi.2$`Rule of Thumb`,
                             GIM.tgi.3$`Rule of Thumb`,
                             GIM.tgi.4$`Rule of Thumb`,
                             GIM.tgi.5$`Rule of Thumb`),
                      tstat = c(GIM.tgi.1$`GIM test statistic`,
                                GIM.tgi.2$`GIM test statistic`,
                                GIM.tgi.3$`GIM test statistic`,
                                GIM.tgi.4$`GIM test statistic`,
                                GIM.tgi.5$`GIM test statistic`),
                      model = c("PID", "Ideo", "Combined", "Interest", "PID x Interest"))

m.plt <- unc.dat %>% 
  ggplot(aes(x = maud, y = model, label = model)) +
  geom_point(size = 3, shape = 24, fill = "blue", color = "darkred") +
  geom_vline(xintercept = 0, lty = "dotted") + 
  labs(x = "MAUD",
       y = "Model", 
       title = "MAUDs by Model Specification") + 
  theme_pubr()

sm.plt <- unc.dat %>% 
  ggplot(aes(x = smaud, y = model, label = model)) +
  geom_point(size = 3, shape = 23, fill = "blue", color = "darkred") +
  geom_vline(xintercept = 0, lty = "dotted") + 
  labs(x = "SMAUD",
       y = "Model", 
       title = "SMAUDs by Model Specification") + 
  theme_pubr()
  
# GIM statistics subfigures. 
rot.plt <- unc.dat %>% 
  ggplot(aes(x = rot, y = model, label = model)) +
  geom_point(size = 3, shape = 22, fill = "blue", color = "darkred") +
  geom_vline(xintercept = 0, lty = "dotted") + 
  labs(x = "(GIM) Rule of Thumb",
       y = "Model", 
       title = "(GIM) Rules of Thumb by Model Specification") + 
  theme_pubr()

tstat.plt <- unc.dat %>% 
  ggplot(aes(x = rot, y = model, label = model)) +
  geom_point(size = 3, shape = 21, fill = "blue", color = "darkred") +
  geom_vline(xintercept = 0, lty = "dotted") + 
  labs(x = "(GIM) Rule of Thumb",
       y = "Model", 
       title = "(GIM) Test Statistics by Model Specification") + 
  theme_pubr()

# Arrange, save, and output plot. 
plt.2 <- ggarrange(m.plt,
                   sm.plt,
                   rot.plt,
                   tstat.plt,
                   nrow = 4) %>% 
  annotate_figure(top = text_grob("Figure 9. Model Dependence by Specification (URL Proxies)",
                  face = "bold",
                  size = 14))
ggsave("output/figures/figure_9.pdf", device = "pdf", width = 8, height = 10)
```
```{r, echo = F, fig.show="hold", out.width="100%", fig.align="center"}
# Load Figure 5 into the RMarkdown document. 
knitr::include_graphics("output/figures/figure_9.pdf")
```

## (A.5) Model Dependence in Guess (2021) Data + URL Section Name Proxy Method 

We find model dependence results in \textbf{Figure 10} that differ yet again using data from Guess (2021) alongside the URL section name proxy classification method from Tyler, Grimmer, & Iyengar (2021). 

```{r, echo = FALSE}
# Analysis of Guess (2021) 2015 data classified using TGI (2021)'s URL
# section-name proxy methodology. 
hgreg.pid <- lm(mean_align ~ age + race + gender + income + 
            educ + pid3, 
            data = hnews, 
            weights = weights) 
hgreg.ideo <- lm(mean_align ~ age + race + gender + income + 
            educ + ideo5, 
            data = hnews, 
            weights = weights) 
hgreg.combo <- lm(mean_align ~ age + race + gender + income + 
            educ + pid3 + ideo5, 
            data = hnews, 
            weights = weights) 
hgglm.pid <- glm(mean_align ~ age + race + gender + income + 
            educ + pid3, 
            data = hnews, 
            weights = weights) 
hgglm.ideo <- glm(mean_align ~ age + race + gender + income + 
            educ + ideo5, 
            data = hnews, 
            weights = weights) 
hgglm.combo <- glm(mean_align ~ age + race + gender + income + 
            educ + pid3 + ideo5, 
            data = hnews, 
            weights = weights)  
```

```{r, echo = FALSE}
# Compute MAUDs and SMAUDs for various model types. 
maud.hgreg.pid <- maud.gen(starprep(model.2.2015)[[1]], 
                        summary(model.2.2015)$coefficients[,2])
maud.hgreg.ideo <- maud.gen(starprep(model.3.2015)[[1]],
                        summary(model.3.2015)$coefficients[,2])
maud.hgreg.combo <- maud.gen(starprep(model.2.2016)[[1]], 
                        summary(model.2.2016)$coefficients[,2])
smaud.hgreg.pid <- maud.gen(starprep(model.2.2015)[[1]], 
                         summary(model.2.2015)$coefficients[,2], 1)
smaud.hgreg.ideo <- maud.gen(starprep(model.3.2015)[[1]],
                         summary(model.3.2015)$coefficients[,2], 1)
smaud.hgreg.combo <- maud.gen(starprep(model.2.2016)[[1]], 
                         summary(model.2.2016)$coefficients[,2], 1)
mauds <- c(maud.hgreg.pid, maud.hgreg.ideo, maud.hgreg.combo)
smauds <- c(smaud.hgreg.pid, smaud.hgreg.ideo, smaud.hgreg.combo)
```

```{r, echo = FALSE, eval=FALSE}
# Compute GIM tests for above models. 
GIM.hgglm.pid <- GIM(hgglm.pid, full = TRUE, B = 30, B2 = 25)
GIM.hgglm.ideo <- GIM(hgglm.ideo, full = TRUE, B = 30, B2 = 25)
GIM.hgglm.combo <- GIM(hgglm.combo, full = TRUE, B = 30, B2 = 25)
```

```{r, echo = FALSE}
# Figure 10

# Load tediously computed GIM test objects.  
GIM.hgglm.pid <- readRDS("input/GIM/GIM_pid.RData")
GIM.hgglm.ideo <- readRDS("input/GIM/GIM_ideo.RData")
GIM.hgglm.combo <- readRDS("input/GIM/GIM_combo.RData")

# MAUD/SMAUD subfigures. 
unc.dat <- data.frame(maud = mauds, 
                      smaud = smauds, 
                      rot = c(GIM.hgglm.pid$`Rule of Thumb`,
                             GIM.hgglm.ideo$`Rule of Thumb`,
                             GIM.hgglm.combo$`Rule of Thumb`),
                      tstat = c(GIM.hgglm.pid$`GIM test statistic`,
                                GIM.hgglm.ideo$`GIM test statistic`,
                                GIM.hgglm.combo$`GIM test statistic`),
                      model = c("PID", "Ideo", "Combo"))

m.plt <- unc.dat %>% 
  ggplot(aes(x = maud, y = model, label = model)) +
  geom_point(size = 3, shape = 24, fill = "blue", color = "darkred") +
  geom_vline(xintercept = 0, lty = "dotted") + 
  labs(x = "MAUD",
       y = "Model", 
       title = "MAUDs by Model Specification") + 
  theme_pubr()

sm.plt <- unc.dat %>% 
  ggplot(aes(x = smaud, y = model, label = model)) +
  geom_point(size = 3, shape = 23, fill = "blue", color = "darkred") +
  geom_vline(xintercept = 0, lty = "dotted") + 
  labs(x = "SMAUD",
       y = "Model", 
       title = "SMAUDs by Model Specification") + 
  theme_pubr()
  
# GIM statistics subfigures. 
rot.plt <- unc.dat %>% 
  ggplot(aes(x = rot, y = model, label = model)) +
  geom_point(size = 3, shape = 22, fill = "blue", color = "darkred") +
  geom_vline(xintercept = 0, lty = "dotted") + 
  labs(x = "(GIM) Rule of Thumb",
       y = "Model", 
       title = "(GIM) Rules of Thumb by Model Specification") + 
  theme_pubr()

tstat.plt <- unc.dat %>% 
  ggplot(aes(x = rot, y = model, label = model)) +
  geom_point(size = 3, shape = 21, fill = "blue", color = "darkred") +
  geom_vline(xintercept = 0, lty = "dotted") + 
  labs(x = "(GIM) Rule of Thumb",
       y = "Model", 
       title = "(GIM) Test Statistics by Model Specification") + 
  theme_pubr()

# Arrange, save, and output plot. 
plt.2 <- ggarrange(m.plt,
                   sm.plt,
                   rot.plt,
                   tstat.plt,
                   nrow = 4) %>% 
  annotate_figure(top = text_grob("Figure 10. Model Dependence in URL Proxy Models",
                  face = "bold",
                  size = 14))
ggsave("output/figures/figure_10.pdf", device = "pdf", width = 6, height = 8)
```

```{r, echo = F, fig.show="hold", out.width="75%", fig.align="center"}
# Load Figure 1 into the RMarkdown document. 
knitr::include_graphics("output/figures/figure_10.pdf")
```

## (A.6) Leave-One-Covariate-Out (LOCO) Inference

Leave-one-covariate-out (LOCO) inference was first proposed in Lei et al. (2018) and generates assessments of variable importance via conformal prediction bands. 

The first step in conducting LOCO inference is to select some machine learning algorithm, $f$, to use for predicting the outcome variable. Then, split the sample in half into $D_1$ and $D_2$ such that $D_1 \cup D_2 = 1,\ldots,n$. The algorithm will then estimate $\hat{f}_{n_1}$ on the first sample, $D_1$. Next, we select variable $j$ and estimate $\hat{f}_{n_1}^{-j}$, the prediction of the outcome variable on a model 

Finally, $D_2$ is used to construct a finite-sample, distribution-free confidence interval for $$\theta_j(D_1) = \text{med}\left(|Y - \hat{f}_{n_1}^{-j}(X)| - |Y - \hat{f}_{n_1}(X)|\bigg|D_1\right)\,\,.$$ 

The beauty of this methodology is twofold. First, it can generate a highly interpretable and intuitive measure of relative importance of your explanatory variables in predicting the outcome variable. Second, you can employ any machine learning algorithm for estimating the conditional expectation function, $f(x) = \mathbb{E}(Y|X=x)$.

We use three machine learning algorithms with cross-validation in our study: LASSO (least absolute shrinking and selection operator), stepwise regression, and random forests. LASSO works similarly to the classical ordinary least squares estimator, but with a constraint imposed as a sublevel set of the $\ell_1$ norm so as to "shrink" certain covariates that are less important towards or to 0. Stepwise regression generates $2^j$ potential models with $j$ covariates and uses either forward selection or backward elimination to fit the optimal model. Stepwise regression is a somewhat antiquated method, and many critiques of it have been generated, particularly as model selection is path-dependent and selecting on $p$-values may not be an optimal objective function. Random forests are advantageous in accounting for potential nonlinearities and work by bagging an ensemble of random trees (CART).

For more details on these specific machine learning algorithms, please see Hastie, Tibshirani, & Friedman (2008).



